Survey modelling

Adjusting for the Lizardman constant


Tricks of particular use in modeling survey data. Hierarchical models to adjust for issues such as non random sampling and the varied great difficulties of eliciting human preferences by asking them. A grab bag of the weird data types, problems and sampling bias problems.

Sampling challenges

What is that Lizardman constant? The problem that 4% of people will claim their president is an alien Lizard monster on a survey; What does that say about the data in general?

Post stratification

Reweightingyou data to correct for various types of remediable sampling bias. See post stratification.

Cluster randomized trials

Melanie Bell, Cluster Randomized Trials

Cluster randomized trials (CRTs) are studies where groups of people, rather than individuals, are randomly allocated to intervention or control. While these type of designs can be appropriate and useful for many research settings, care must be taken to correctly design and analyze them. This talk will give an overview of cluster trials, and various methodological research projects on cluster trials that I’ve been undertaken: designing CRTs, the use of GEE with small number of clusters, handling missing data in CRTs, and analysis using mixed models. I will demonstrate methods with an example from a recently completed trial on reducing cardiovascular risk among Mexican diabetics.

Ordinal data

Ordinal models are how we usually get data from people. Think star ratings, or LIkert scales.

sjplot is a handy package for exploratory plotting of Likert-type responses for social survey data. by Daniel Lüdecke.

Confounding and observational studies

See Causal graphical models.

There is some interesting crossover with clinical trial theory.

It is a commonly held belief that clinical trials, to provide treatment effects that are generalizable to a population, must use a sample that reflects that population’s characteristics. The confusion stems from the fact that if one were interested in estimating an average outcome for patients given treatment A, one would need a random sample from the target population. But clinical trials are not designed to estimate absolutes; they are designed to estimate differences as discussed further here. These differences, when measured on a scale for which treatment differences are allowed mathematically to be constant (e.g., difference in means, odds ratios, hazard ratios), show remarkable constancy as judged by a large number of published forest plots. What would make a treatment estimate (relative efficacy) not be transportable to another population? A requirement for non-generalizability is the existence of interactions with treatment such that the interacting factors have a distribution in the sample that is much different from the distribution in the population.

A related problem is the issue of overlap in observational studies. Researchers are taught that non-overlap makes observational treatment comparisons impossible. This is only true when the characteristic whose distributions don’t overlap between treatment groups interacts with treatment. The purpose of this article is to explore interactions in these contexts.

As a side note, if there is an interaction between treatment and a covariate, standard propensity score analysis will completely miss it.

Graph sampling.

See inference on social graphs.

Data sets

Parsing SDA Pages

SDA is a suite of software developed at Berkeley for the web-based analysis of survey data. The Berkeley SDA archive lets you run various kinds of analyses on a number of public datasets, such as the General Social Survey. It also provides consistently-formatted HTML versions of the codebooks for the surveys it hosts. This is very convenient! For the gssr package, I wanted to include material from the codebooks as tibbles or data frames that would be accessible inside an R session. Processing the official codebook from its native PDF state into a data frame is, though technically possible, a rather off-putting prospect. But SDA has done most of the work already by making the pages available in HTML. I scraped the codebook pages from them instead. This post contains the code I used to do that.

Achlioptas, Dimitris, Aaron Clauset, David Kempe, and Cristopher Moore. 2005. “On the Bias of Traceroute Sampling: Or, Power-Law Degree Distributions in Regular Graphs.” In Proceedings of the Thirty-Seventh Annual ACM Symposium on Theory of Computing, 694–703. STOC ’05. New York, NY, USA: ACM. https://doi.org/10.1145/1060590.1060693.

Bareinboim, Elias, and Judea Pearl. 2016. “Causal Inference and the Data-Fusion Problem.” Proceedings of the National Academy of Sciences 113 (27): 7345–52. https://doi.org/10.1073/pnas.1510507113.

Bareinboim, Elias, Jin Tian, and Judea Pearl. 2014. “Recovering from Selection Bias in Causal and Statistical Inference.” In AAAI, 2410–6. http://ftp.cs.ucla.edu/pub/stat_ser/r425.pdf.

Bond, Robert M., Christopher J. Fariss, Jason J. Jones, Adam D. I. Kramer, Cameron Marlow, Jaime E. Settle, and James H. Fowler. 2012. “A 61-Million-Person Experiment in Social Influence and Political Mobilization.” Nature 489 (7415): 295–98. https://doi.org/10.1038/nature11421.

Broockman, David E., Joshua Kalla, and Jasjeet S. Sekhon. 2016. “The Design of Field Experiments with Survey Outcomes: A Framework for Selecting More Efficient, Robust, and Ethical Designs.” SSRN Scholarly Paper ID 2742869. Rochester, NY: Social Science Research Network. https://papers.ssrn.com/abstract=2742869.

Gao, Yuxiang, Lauren Kennedy, Daniel Simpson, and Andrew Gelman. 2019. “Improving Multilevel Regression and Poststratification with Structured Priors.” August 19, 2019. http://arxiv.org/abs/1908.06716.

Gelman, Andrew. 2007. “Struggles with Survey Weighting and Regression Modeling.” Statistical Science 22 (2): 153–64. https://doi.org/10.1214/088342306000000691.

Gelman, Andrew, and John B. Carlin. 2000. “Poststratification and Weighting Adjustments.” In In. Wiley. http://www.stat.columbia.edu/~gelman/research/published/handbook5.pdf.

Ghitza, Yair, and Andrew Gelman. 2013. “Deep Interactions with MRP: Election Turnout and Voting Patterns Among Small Electoral Subgroups.” American Journal of Political Science 57 (3): 762–76. https://doi.org/10.1111/ajps.12004.

Hart, Einav, Eric VanEpps, and Maurice E. Schweitzer. 2019. “I Didn’t Want to Offend You: The Cost of Avoiding Sensitive Questions.” SSRN Scholarly Paper ID 3437468. Rochester, NY: Social Science Research Network. https://papers.ssrn.com/abstract=3437468.

Kennedy, Edward H., Jacqueline A. Mauro, Michael J. Daniels, Natalie Burns, and Dylan S. Small. 2019. “Handling Missing Data in Instrumental Variable Methods for Causal Inference.” Annual Review of Statistics and Its Application 6 (1): 125–48. https://doi.org/10.1146/annurev-statistics-031017-100353.

Kohler, Ulrich, Frauke Kreuter, and Elizabeth A. Stuart. 2019. “Nonprobability Sampling and Causal Analysis.” Annual Review of Statistics and Its Application 6 (1): 149–72. https://doi.org/10.1146/annurev-statistics-030718-104951.

Kong, Yuqing. 2019. “Dominantly Truthful Multi-Task Peer Prediction with a Constant Number of Tasks.” November 1, 2019. http://arxiv.org/abs/1911.00272.

Krivitsky, Pavel N., and Martina Morris. 2017. “Inference for Social Network Models from Egocentrically Sampled Data, with Application to Understanding Persistent Racial Disparities in Hiv Prevalence in the Us.” The Annals of Applied Statistics 11 (1): 427–55. https://doi.org/10.1214/16-AOAS1010.

Lerman, Kristina. 2017. “Computational Social Scientist Beware: Simpson’s Paradox in Behavioral Data.” October 24, 2017. http://arxiv.org/abs/1710.08615.

Little, R. J. A. 1993. “Post-Stratification: A Modeler’s Perspective.” Journal of the American Statistical Association 88 (423): 1001–12. https://doi.org/10.1080/01621459.1993.10476368.

Little, Roderick JA. 1991. “Inference with Survey Weights.” Journal of Official Statistics 7 (4): 405.

Prelec, Dražen, H. Sebastian Seung, and John McCoy. 2017. “A Solution to the Single-Question Crowd Wisdom Problem.” Nature 541 (7638): 532–35. https://doi.org/10.1038/nature21054.

Rubin, Donald B, and Richard P Waterman. 2006. “Estimating the Causal Effects of Marketing Interventions Using Propensity Score Methodology.” Statistical Science 21 (2): 206–22. https://doi.org/10.1214/088342306000000259.

Shalizi, Cosma Rohilla, and Edward McFowland III. 2016. “Controlling for Latent Homophily in Social Networks Through Inferring Latent Locations.” July 22, 2016. http://arxiv.org/abs/1607.06565.

Shalizi, Cosma Rohilla, and Andrew C. Thomas. 2011. “Homophily and Contagion Are Generically Confounded in Observational Social Network Studies.” Sociological Methods & Research 40 (2): 211–39. https://doi.org/10.1177/0049124111404820.

Yadav, Pranjul, Lisiane Prunelli, Alexander Hoff, Michael Steinbach, Bonnie Westra, Vipin Kumar, and Gyorgy Simon. 2016. “Causal Inference in Observational Data.” November 14, 2016. http://arxiv.org/abs/1611.04660.