Identifying the most and least politically tolerant county in the US – PredictWise analytics and reporting by The Atlantic
Today, The Atlantic published a story identifying the least and most politically tolerant counties in the US and along the way ranking every county on a political tolerance scale, using analytics by PredictWise. The story is great for multiple reasons (yes, I am biased), not the least of which is the combination of rigorous quantitative analyses with qualitative methods, or what I would refer to as data-driven ethnography. I am not aware of many journalists that get at incredibly complex problems with a structured methodological approach. It opens a healthy debate and beats every New York Times Ohio diner story! Of course, every exhaustive ranking opens up numerous debates, and some are more fun than others: Residents of Suffolk County might feel the county’s top spot is undeserved: after all, many residents there have conversations with folks, even friends, identifying with the out party. Here is an example: Susy (name changed) is outraged: After all, she has always prided herself with fostering friendships across the aisle – her friend who she gets breakfast with once a week is a Republican, and while she disagrees with her politically, she loves her friend dearly. Her question: Does the model pick this up? The answer: No, it does not, it cannot, and it never will.
Every model is a representation of some unknown outcome that offers estimates of that outcome. Frustratingly, error is inherent in all models, and not knowing the true outcome (in this case political tolerance) means we can never know the true error. Using analytics, surveys and Big Data to reflect on microtrends rather should be a virtuous cycle in which researchers lay bare their assumptions and decisions, and other researchers check, correct and ultimately improve upon the first model. In this spirit, here is a more detailed breakdown of PredictWise analytics, including transparency re decision rules, data, etc.
The Survey Instrument
First, PredictWise collected 2,000 survey responses across the country, using a sampling technique called Random Device Engagement (RDE). For more background, read here, but the gist of it is that we use advertising networks on mobile devices to engage random people where they are to answer our surveys. RDE has a good coverage of 7,000,000 respondents in the US (much deeper than most panels), and allows us to collect ambient data on top of survey responses: most interestingly a rich history of highly precise device-based geo-location coordinates. We then surveyed our unique respondents on 14 questions – the full survey instrument is below:
- How would you react if a member of your immediate family married a Democrat?
- How would you react if a member of your immediate family married a Republican?
- How well does the term 'Patriotic' describe Democrats?
- How well does the term 'Selfish' describe Democrats?
- How well does the term 'Willing to compromise' describe Democrats?
- How well does the term 'Compassionate' describe Democrats?
- How well does the term 'Patriotic' describe Republicans?
- How well does the term 'Selfish' describe Republicans?
- How well does the term 'Willing to compromise' describe Republicans?
- How well does the term 'Compassionate' describe Republicans?
- How do you feel about the Republican Party today?
- How do you feel about the Democratic Party today?
- How do you feel about Democratic voters today?
- How do you feel about Republican voters today?
In addition we collected demographic information, partisan identification and matched respondents back to the full voter file using our history of geo-coordinates, taking as the home latitude-longitude pair the modal data entry between 07 pm and 05 am local time, and relying on address+demographics fuzzy matching. Finally, we have to drop all self-declared independents – after all, how do you determine tolerance for the out party of somebody who has no partisan affiliation (although we do count self-declared Independents who consistently score one party very low and the other party very high on Feelings Thermometers we collected as partisans)?
Methodology: High Level
PredictWise routinely relies on highly evolved variants of Mr.P. (or, spelled out: multi-level regression and post-stratification). The method allows us to take relatively low-N-survey data and derive small-area estimates. In fact, we have driven the methodological debate related to Mr.P. for years, and published extensively on it (here, here, here). In short, we model the outcome of interest (here: political tolerance) based on: urbanicity based on home address, age, gender, education, household composition, race, party affiliation, and two variables we use to describe the neighborhood: age variation and variation in partisan identification at the census block where the individual resides. Our model is a multiplicative multi-level model producing estimates of political tolerance for Millions of demographic combinations. These models are powerful because every single response is used to train all parameters. So, the political tolerance of a married Republican over 55 with a college education, living in suburban neighborhoods with a high mix of partisan affiliation and age (measured at the census block) increases precision in all separate parameters simultaneously! Our full Mr.P. model is complex, evolved over years, and spelled out at the bottom for methods geeks (any/all feedback welcome!).
The last step of this kind of analysis means weighting estimates for all (in this case many Millions of) demographics by the fraction of the demographic of interest in the target population, and that is the crux: Of course, we have to identify the fully interacted count table at the county level. So, we have to know: how many white, married Republicans over 55 with a college education, living in suburban neighborhoods with a high mix of partisan affiliation and age are there, really? That data is unknown. We do our best to use augmented full commercial voter file acquired through TargetSmart , and impute the many missing records with records from the ACS at the census block group, which is done probabilistically. There is one further difficulty: we have to identify partisans. It is crucial that we identify partisans in the model, as there is some strong support for asymmetric polarization – members of different parties feeling differently about political tolerance – so not including it in the model introduces error (which we call poor model fit), but including it in the model means identifying the exact number of partisans by county – a notoriously difficult task. In short: it is a trade-off. We follow a pre-defined decision rule to identify partisans: 1) relying on partisan registration , 2) relying on primary vote in case party registration is not available, 3) relying on voter file models calibrated such that the national distribution of Republicans and Democrats matches the national Gallup average. Can mistakes happen? Yes, we do suspect some of the sharp state differences can be artifacts of how party data is collected at the state level. Counter argument? We only run into these differences in a handful of states, and the most blatant examples, South Carolina and Florida, are states in which we find partisans to be very insulated in their neighborhoods by age and partisan affiliation. So, we decided to let the data speak for itself, instead of smoothing (read: fudging) our results ex post to avoid controversy.
The Bottom Line: Replicate, Debunk, Further the Debate (But Be Transparent!)
PredictWise prides itself in transparency. I am beyond happy to see the kind of healthy and lively debate the Atlantic story created – that is a good thing! And, more data and better models will always improve existing research. In that sense, I cannot wait for these results to be replicated, shared, debunked and improved. Of course, the perfect solution is a survey answered by 100s of Millions of Americans – but that remains elusive. Instead, a combination of novel survey methods, analytics and computing – made possible by recent advancements in statistics and computer science – allows us to get some, any handle on geo-spatial variation in and understanding of phenomena like political tolerance more important than ever in today’s political climate. In that sense, let this be the first of many analyses on this subject. And to all applied researchers: happy replicating, improving, debunking!
Here is our variant of Mr.P used in (Warning: it gets technical). We model the survey data based on a Bayesian quasi-IRT model assuming every outcome we are interested in can be explained by an item-specific discrimination parameter or slope, and a set of item-and category-specific difficulty parameters or intercepts plus a single latent trait (read: our respondents’ standing on the tolerance dimension). In essence, these are sequential Bayesian ordered logits assuming the same underlying latent trait eta. Here is the Bayesian spirit of the models applied to two outcomes – whether you are OK with your offspring marrying an out party spouse, and how selfish you think members of the out party are:
The latent trait itself is defined by a set of individual-level and census-block-level predictors: urbanicity based on home address, age, gender, education, household demographics race, party affiliation, and two variables we use to describe the neighborhood: age variation and variation in partisan identification at the census block where the individual resides (see below). All predictors are themselves drawn from a prior normal distribution with mean 0 and variance learned from the data patterns.
Based on this equation, we create outcomes for each of the demographics we care about – notably the full set of interactions of all variables above (so, Millions of demographics!). The last step of this kind of analysis means weighting estimates by the fraction of the demographic of interest in the target population (see above for more discussion on the projection or target space).