In 2012 my team took on the challenge of making a daily poll for Xbox and tracked voter intention for the last 45 days. The Xbox audience is a little younger and male than the voting population (just kidding, a lot younger and a lot more male!). But, we worked on a new method of modeling and post-stratifying the data and did a great job in predicting the outcome of the election. This year, in collaboration with Tobias Konitzer and Sam Corbett-Davies, both graduate students at Stanford, we have taken on two different types of opt-in polling: an old-school internet display poll on a portal (MSN) and a mobile-only poll, distributed via the Pollfish app. This has presented new challenges with new models. I want to start by focusing today on MSN and circle back to Pollfish next week.
For the beginning of August, MSN has been running daily polls on its election 2016 page. We ask between 8 and 9 questions per day. Generally, that includes 4-5 issues of the day: some of them regularly occurring issues like support for gun control or immigration, and others germane to the day like concern over tax returns or interest in the upcoming debate. Then we ask a series of demographic questions: age, gender, party identification, presidential approval. And, one more key question, the vote intention of the respondent.
While people take the poll, they see the raw results, but on this page we are going to highlight the analyzed results. Every poll you ever see has raw results and analyzed results. The New York Times provided raw polling results, from Florida, to four respected pollsters (including me!) and showed how they got four polished, analyzed poll results from them ranging from Trump +1 to Clinton +4. The raw poll numbers: Clinton +8. There is a lot of interpretation in the analysis, but we believe that our procedure is the best for this type of opt-in polling data.
What we do is pretty advanced, but also very intuitive: we model and post-stratify the data. We run the multilevel regression with post-stratification (MRP), following the general principles described in this recent academic paper, but add with new advances. Seriously, let me slow that down …
First, we model the raw response data of vote intention, given the following respondent characteristics: age, gender, state, and party identification. We are actually imputing party identification from a mix of the stated party identification and presidential approval. This information divides the population into hundreds of categories of demographics and we predict the percent of people in each category that would poll for Clinton, Trump, or Other, if the entire country showed up to the poll. Every one of those predictions is informed by all polling responses, not just each day, but in the past as well. In more technical terms, we have added complex dynamics to the model that allows us to parse out variance in sample composition over time from true swings over time. This is really important, because some/most of the demographic combinations do not come to the poll on any given day and this is a major advancement over what we did last cycle.
The models starts running on the first day of the polling data and creates a series of coefficients for each demographic to predict the marginal impact of that demographic on how people answer the voter intention question. Unlike previous years, we then run the model for the next day with the previous day’s coefficients as a sort of baseline. We restrict how much volatility these coefficients can have each day. The more we see of a demographic, the more we allow it to be completely defined by each day’s polls, versus from all of the polling. For instance, assume that the marginal impact of being male versus female is allowed to evolve based on each day’s polling, because we see a lot of men and women. But, the coefficient for North Dakota is basically derived from all of the data, because we do not see too many people from North Dakota.
We further supplement the data by using previous demographics to fill in respondents who do not answer with their demographics at later dates. So, if respondent answers the demographics at any point, we can then count their responses for the rest of the cycle.
Second, we then projected these predictions on our best-estimate of the likely voting population. This likely voting space comes from three data sources: a voter file of all voters, the Census, and the latest polling on party identification. From this data we can estimate the number of voters in each of these categories who will turn out to vote in November. Here is the national voter intention going into the final debate on October 19:
This chart is the product of 379,152 unique polls answered by 122,963 unique readers over 2.5 months. Readers are limited to answering once per day, but are otherwise able to come back. Actually, we are excited that 8,000 respondents have answered 10 or more times over the 2.5 months providing a unique panel allowing us to examine within-person shift in voter intention.
The readers do not resemble the voting population; if we just reported the raw data, we would be in trouble. 75 percent of respondents are over 45 years old versus about 55 percent of the voting population. And 57 percent are male versus about 46 percent of the voting population.
Not only do we project the national popular vote, but also the state-by-state vote and detailed demographics. Below are the detailed demographics from the day before the final debate. While these results are experimental, we are confident that they are a meaningful representation of vote intentions at that time. We see big jumps around the first debate and the release of the Access Hollywood tape. In fact, our movement closely mirrors the polling average of traditional polls on Huffington Post’s Pollster; we just have a lot less volatility and cost a fraction of the money to run daily!