By 2016 the idea of data scientists informing major campaign decisions based on data voodoo generated in windowless caves  had been omnipresent in the endless discourse of American campaigns. Then, during the 2016 campaign, a small data analytics shop with the name of Cambridge Analytica took it a step further: As first reported by the Swiss magazine Das Magazin, the CEO of the company, Alexander Nix, was fond of saying that his team had profiled the personality of every adult in the United States of America—220 million people, truly a new milestone in the advent of Big Data. And, not just any data, but psychometrics, the new Holy Grail of political analytics. Of course, accounts like this were soon discredited as overstated and inaccurate. But, that did not mean that this type of data could not have value. And, we set up to test it right.

We have created scores (0 to 100) on nine psychometric variables, for every eligible voter in America. These scores are created from a cluster of questions on the topic that are then modeled and projected onto the full voter file (specifically Target Smart). We set out to first answer the question: if you were a researcher or practitioner, and it was early 2016, what data do you wish you had that would have both allowed you to better forecast and understand the presidential, senatorial, and congressional elections? That is the easy part, because we can fit the data to the outcomes, the hard part is: if you are a researcher or practitioner, and it is early 2018, what data that is not currently, commonly available will allow you to better forecast and understand the senatorial and congressional elections to come? That is what we are trying to create with these psychometric variables (and others, coming soon).

Note on Methods: All clusters are scored using 20,000+ survey responses (and counting, as we add more responses each month). On the survey side, we model scores on clusters via a Bayesian latent variable model (developed by us on the back of the best academic research) of the components in any given cluster. The multi-item measures and the strong signal from 20,000 data points increase the accuracy of these models far beyond typical support models. In layman’s terms: you cannot ask people if they are racist or like authoritarians, because there is strong social desirability to not be racist and dislike authoritarians (well, at least in most places around the US). Instead, we ask 5 to 9 questions on each of these topics, which taken together describe a latent belief. Then, we model these question responses together to evaluate the respondent’s latent belief, even if they may provide something differently if we had asked them directly. A second benefit of asking clustering multiple questions is that it aggregates out the measurement error that any one question may have.

We then project the model coefficients onto the voter file, using, amongst other features, individual-level covariates: age, gender, race, education, party identification, household type, urbanicity based on cell-phone ambient data, and location. We correct the state-level distributions of partisans by calibrating to PredictWise Trump approval data, consisting of survey responses from close to 100,000 Americans. The end result is a continuous score for each cluster ranging from 0 to 100 for every American represented on the file, resulting in 257,799,100 records per variable.

Psychometric Variables:

1) Anti-Elitist Populism (Theoretical Range: 0=not at all populist; 100=extremely populist): anti-elite populism is aimed to capture if people believe there is a struggle between “the people” and “the elites,” or, in starker terms, a conspiracy of elites to stack the odds against ordinary citizens. We ask questions about trust in elites, distribution of power between “people” and “elites”, and thoughts on the “system.”

2) Economic Populism (Theoretical Range: 0=not at all populist; 100=extremely populist): economic populism is distinct from anti-elite populism in that it focuses on the economic struggle of the working class. We ask about unions, big business, and the social safety net. Someone can be labeled  populist if she believes the world is stacked against them, but not be an economic populist who thinks the government should provide a ladder up. President Trump sold voters on being both a populist and someone who believes in economic populism, but while his rhetoric still matches populism, his policies are the antithesis of economic populism.

3) Racial Resentment (Theoretical Range: 0=not at all racially resentful; 100=extremely racially resentful): we use a standard battery of questions about racial resentment. We ask about black work ethic, historical discrimination, crime, and forms of black protest. Again, this is topic which is particularly hard to poll directly, so it is critical to get questions that people are comfortable answering truthfully to properly categorize their latent or “true” beliefs.

4) Traditionalism (Theoretical Range: 0=not at all traditionalist; 100=extremely traditionalist): traditionalism aims to define people’s belief in conservative mores. We ask questions about corporal punishment in school and at home, generational divides, morality, and religious depth. A traditionalist believes the world was better 50 or a hundred years ago.

5) Compassion (Theoretical Range: 0=not at all compassionate; 100=extremely compassionate): compassion aims to define people’s compassion for those who are less fortunate than them. We ask questions about disability, homelessness, reformed criminals, etc.

6) Globalism (Theoretical Range: 0=not at all globalist; 100=extremely globalist): globalism means to capture beliefs on how open the US should be as a society. We ask about trade, automation, and fears about global economy.

7) Authoritarianism (Theoretical Range: 0=not at all authoritarian; 100=extremely authoritarian): we capture people’s latent preference for centralized authority, i.e. systems in which decisions are derived autocratically instead from a legal body, and where dissent is not tolerated as an acceptable form of freedom of speech. We ask about free speech, obedience, and respect for authority.

8) Trust in Institutions (Theoretical Range: 0=not at all trusting in institutions; 100=extremely trusting in institutions): we are aiming at a holistic view of people’s trust in US institutions ranging from media to political to non-political government, with  focus on the intelligence community. It is interesting to see how this shifts with party control.

9) Presidential (Theoretical Range: 0=Trump not at all presidential; 100=Trump extremely presidential): this is a set of questions that is explicitly tied to the current president. We ask about the presidents’: appropriateness, honesty, work ethic, competence, morality, communication, and managerial skills.

In a previous post we showed how racial resentment would have been a good predictor of county-level movement towards the Democratic senatorial candidate in Alabama in 2017, conditional on other standard demographics. Below we plot changes in Democratic two-party-vote-share 2017-2012 in Alabama, and a suggestive story emerges: changes in vote share indeed correlates with our county-level racial resentment scores. In counties less plagued by racial resentment, voters switched to Democratic Doug Jones at higher rates. Over the next few weeks and months we will roll out validations for the power of the other eight variables.


There is more validation work to be done in understanding these psychometric variables beyond showing they would be helpful in predicting the outcome of an election: targeting and messaging. First, we are exploring how these variables can help refine targets for likely marginal voters when it comes to persuadability. We have models for both, persuasion and likely turnout, but do any of these variables improve the models? Second, we will explore how these variables can help generate appropriate messaging for target populations. Should advertisers tailor messages based on these latent variables?