By 2016 the idea of data scientists informing major campaign decisions based on data voodoo generated in windowless caves had been omnipresent in the endless discourse of American campaigns. Then, during the 2016 campaign, a small data analytics shop with the name of Cambridge Analytica took it a step further: As first reported by the Swiss magazine Das Magazin, the CEO of the company, Alexander Nix, was fond of saying that his team had profiled the personality of every adult in the United States of America—220 million people, truly a new milestone in the advent of Big Data. And, not just any data, but psychometrics, the new Holy Grail of political analytics. Of course, accounts like this were soon discredited as overstated and inaccurate. But, that did not mean that this type of data could not have value. And, we set up to test it right.
We have created scores (0 to 100) on ten issue clusters, for every eligible voter in America. These scores are created from a cluster of questions on the topic that are then modeled and projected onto the full voter file (specifically Target Smart). We set out to first answer the question: if you were a researcher or practitioner, and it was early 2016, what data do you wish you had that would have both allowed you to better forecast and understand the presidential, senatorial, and congressional elections? That is the easy part, because we can fit the data to the outcomes, the hard part is: if you are a researcher or practitioner, and it is early 2018, what data that is not currently, commonly available will allow you to better forecast and understand the senatorial and congressional elections to come? That is what we are trying to create with these psychometric variables (and others, coming soon).
Note on Methods: All clusters are scored using 20,000+ survey responses (and counting, as we add more responses each month). On the survey side, we model scores on clusters via a Bayesian latent variable model (developed by us on the back of the best academic research) of the components in any given cluster. The multi-item measures and the strong signal from 20,000 data points increase the accuracy of these models far beyond typical support models. In layman's terms: you cannot ask people if they are racist or like authoritarians, because there is strong social desirability to not be racist and dislike authoritarians (well, at least in most places around the US). Instead, we ask 5 to 9 questions on each of these topics, which taken together describe a latent belief. Then, we model these question responses together to evaluate the respondent's latent belief, even if they may provide something differently if we had asked them directly. A second benefit of asking clustering multiple questions is that it aggregates out the measurement error that any one question may have.
We then project the model coefficients onto the voter file, using, amongst other features, individual-level covariates: age, gender, race, education, party identification, household type, urbanicity based on cell-phone ambient data, and location. We correct the state-level distributions of partisans by calibrating to PredictWise Trump approval data, consisting of survey responses from close to 100,000 Americans. The end result is a continuous score for each cluster ranging from 0 to 100 for every American represented on the file, resulting in 257,799,100 records per variable.
1) Free Trade (Theoretical Range: 0=no support for free trade; 100=full support for free trade): we focus on the actual trade-offs of free trade where more/cheaper goods contrast with possible quality control, stress on the border, and jobs shifting from from established to new industry.
2) Regulation (Theoretical Range: 0=no support for gvt. regulation; 100=full support for gvt. regulation): we run through a few key segments from environmental, workplace, financial, product safety, and food and food drug safety regulation.
3) Safety Net for Poor (Theoretical Range: 0=no support for safety net; 100=full support for safety net): we focus on attitudes toward support for the poor when it comes to education, shelter, healthcare, food and job training.
4) Immigrants (Theoretical Range: 0=no support for immigrants; 100=full support for immigrants): we consider attitudes toward recent legal and illegal immigrants as well as refugees, and gauge perceptions of trade-offs between national security and job security on one hand, and open immigration on the other hand.
5) Military (Theoretical Range: 0=no support for expansive military role; 100=full support for expansive military role): we consider support for increasing the role and scope of the military in a number of areas, including securing US territory, protecting international trade routes, spreading democracy, and countering state-sponsored terrorism.
6) Guns (Theoretical Range: 0=no support for freedom to own/buy guns; 100=full support for freedom to own/buy guns): we consider attitudes toward rights to buy guns of the mentally ill and with or without background check. We also gauge support for allowing citizens to own assault weapons, carry concealed weapons and for the government to register gun owners.
7) Healthcare (Theoretical Range: 0=no support for gvt. provided healthcare; 100=full support for gvt. provided healthcare): we consider attitudes toward government-subsidized healthcare for the poor, those with pre-exisiting conditions, the elderly, mothers and newborns, as well as emergency treatments.
8) Taxes (Theoretical Range: 0=no support for raising taxes; 100=full support for raising taxes): we focus on attitudes towards: taxation of the wealthy and the the middle class, as well as preferences on inheritance taxes, corporate taxes, and capital gains taxes.
9) Women’s Healthcare (Theoretical Range: 0=no support for women’s healthcare/reproductive rights; 100=full support for women’s healthcare/reproductive rights): we consider attitudes towards women's and girls' healthcare, including preferences on allowing access to birth control without parental consent, comprehensive sexual education in schools, and access to abortions in a variety of circumstances.
10) Religious Freedom (Theoretical Range: 0=no support for religious freedom to deny equal rights; 100=full support for religious freedom to deny equal rights): we consider attitudes toward conflicts between discrimination and religious freedom. Specifically, we gauge preferences on allowing to access the bathroom of choice for transgender people, adoption rights for LGBT couples, the right to deny certain procedures for religious hospitals, the right to withhold products from potential customers with sexual orientations offending the beliefs of the seller, and the right to let employees with non-conformist sexual orientation go, on the basis of religious beliefs.
While the end-result, individual-level scores, can readily be used for targeting and messaging, the key differentiator of the PredictWise data from untested competitors (read: Cambridge Analytica), is validation: The most interesting question – related to quality of the data – points to the usefulness of such data. Here, we focus on our scores of preferences for conservative gun policies, and roll up our scores to the state level. The resulting map comes with high face-validity. Overall, the country is very liberal on gun policies. Exceptions are the corridor around the northern national parks, specifically Montana, Idaho and North Dakota, as well as the the upper North East, specifically Maine, and West Virginia. On the other hand, the constituents of California, Michigan, and Wisconsin favor stricter gun control laws.
To assess whether we capture meaningful variation, we obtained data on firearm mortality from the CDC (for 2016). There is good evidence that state-level firearm mortality correlates with state-level gun laws, and if public opinion is a predictor of local policy, we should see a positive relationship between state-level preferences for conservative gun policies and firearm mortality. This is exactly what we find, visualized in the graph below.
There is more validation work to be done in understanding these issue variables beyond showing they would be helpful in predicting real-world outcomes. First, we are exploring how these variables can help refine targets for likely marginal voters: both persuasion and likeness to vote. We have models for both, but do any of these variables make them better? Second, we will explore how these variables can help generate appropriate messaging for target populations, showing that advertisers should tailor messages based on these latent variables?