Predicting the World Cup is not that much different from predicting other sports outcomes or even economic indicators or awards shows. First, we determine what the stakeholders want to know. For the World Cup, we determine that it is the likelihood of win, loss, or draw for either team in any game and the likelihood of any team advancing to any round (including winning the tournament); for reasons of tie-breakers and expediency we also consider goal differential. Second, as always, we ensure that these forecasts update as the games progress. Finally, we always consider the same set of data to ensure accuracy.

In the course of our regular forecasting we always review four different data types: fundamental data, online and social media, prediction markets, and polls of experts. Online and social media data are not significant for the World Cup, at this point. This type of data clearly provides value in understand the support and interest of people from around the world, but lacking historical context, it is impossible to identify if it has any predictive power relative to more traditional data. And, while polls of experts can be useful in predicting sports, we are going to keep things simple and transparent for this World Cup and focus on fundamental data and prediction markets.

I am going to walk through the fundamental data in some length before describing the prediction market data quickly. That is because the fundamental data is much more interesting and the prediction market data is the same as it always is, in all domains. But, it is a lot more predictive than the fundamental data and, despite my fun in running the fundamental data, prediction market data forms the basis of all of the forecasts we are going to generate.

Using fundamental data to predict how teams will do across a season, or in an upcoming game, is a relatively stable task across major sports. The key fundamental variables are always the same: scoring differential, home and away, and wins/losses in past season. Of course, different sports counts scores in different ways (e.g., American football has scores that range from 1 point to 6 points) and count wins/losses differently as well (e.g., soccer has outcomes that range from 0 to 3 points). Generally, home and away (and strength of schedule) are balanced, but that too is not always the case (e.g., baseball loads the schedule heavily with teams in the same division); home field has a huge advantage in soccer (e.g., in a fun example, this article notes that injury time in Spain heavily favors the home team). That being said, give me the scoring differentials of each team from the previous year, their schedule including home and away, and their final outcome in wins/losses and I can predict both season and game-by-game outcomes with precision.

We can improve upon this baseline prediction in several ways: account for shifts in personnel and factor out luck. All of the major sports now have models of wins or points above replacement; an idea that was generated out of baseball’s sabermetric community. This metric describes how valuable a certain player is compared with a baseline player in his/her position. There is still some debate on this metric and it varies a lot by sport, but a reasonable version of it will allow a researcher to get pretty close to quantifying the impact of a substitution of one player for another. Further research has examined the role of luck in the wins/points of any team in a given year to factor out what was in the control of the players and what was either lucky or unlucky.

Soccer is a standard case as I just described: goal differential, home/away, previous year’s points will get you pretty far predicting future outcomes. Add in the wins over replacement in changes in the team and factor in luck and you can be as good as anything.

Playoff predictions are just compilation of game-by-game predictions, using the current regular season’s data. There are two small quirks to consider, effort and playoff design. First, in certain there are definable times when teams are not at full strength or maximum effort, such as a late in the season for teams with nothing to play for; in those situations we need to account for this differential effort. Second, compiling the likelihood of a team advancing in any given round depends on the design of the playoffs. Single eliminations are straight forward applications of the likelihood to win a game formula between two teams, but best of seven series and round robins have their quirks (e.g., NBA teams are more likely to win game 2 if they lose game 1, than if they won game 1).

In short, major team sports from around the world are all pretty similar in predicting regular season and playoff success, but the World Cup has one crazy quirk; it as no regular season. There are direct comparable variables, but they are noisier (i.e., much less precise). Countries compete in three types of matches with other countries on a semi-regular basis between World Cups: friendly matches with other countries, regional tournaments, and World Cup qualification tournaments. All of the games combined are a fraction of what a regular season is in most leagues.

These games provide similar data to what we normally have: there is a goal differential, there is home/away, and, in lieu of past season wins/points, we have world rankings (complied by FIFA based on team’s performance in the last four years) and elo rankings (which is based on head-to-head matches). Unlike a regular season where the choice of opponents and location are balanced (or the choice set is transparent), the schedule of any team is endogenously chosen by the countries to maximize the return for their team, and more wins in a tournament means more games against better teams. Also, there are major personnel changes over any four year period, especially with players going in and out for friendlies and lesser tournaments.

Specifically we start with the following:

1) Average goal differential broken up by home/away/neutral, and friendly/tournament/World Cup qualifier. The friendly/tournament/World Cup qualifier split lets us examine the predictive power of game that are likely to have lower effort and more variable personnel.

2) World ranking and elo score act as the equivalent of points/wins from previous years and the elo score absorbs the strength of the schedule a team has played.

We take this data for past World Cup cycles and regress this on all of the World Cup games to get coefficients for the various variables. We can then plug in the 2014 data to get baseline forecasts for any given game going into the World Cup, both goal differential and likelihood of win, loss, or draw in any game.

The differences in goal differential swamp the rankings in both predicting goal differential and probability of victory in any game. This is not surprising as these rankings are just reflections of win/loss/draw (slightly coded by strength of oppenent), which is trumped by goal differential. Further, the away games are slightly more predictive than home games, which is not surprising, as there is just one home team in the World Cup.

Yet, these predictions for the World Cup games are a lot less precise than the predictions for a regular season or playoff soccer game. With all of the idiosyncratic variables of a World Cup, where teams with no regular season play at neutral sites, the fundamental data is going to provide forecasts of scores with larger margins of error and probability of victories that tend more towards toss-up than we would normally produce.

That is where prediction market data comes into play; it does its best when there is idiosyncratic data to incorporate. Prediction markets buy and sell contracts that are, canonically worth $1 if true and $0 if not. Thus, the price on a contract for Brazil to win the World Cup or any particular game is highly predictive of the probability of the outcome occurring. Massive amounts of historical data helps us translate raw prediction market prices into very precise probabilities of outcomes; this especially true in World Cup, where the prediction markets have very robust action on all games.

Armed with fundamental data and prediction market-based forecasts for every game, we jump into the actual World Cup action. The tournament setting for the first round is a round robin with four teams playing three games each for a total of six games. After that there is a standard 16 team single elimination tournament where the winner of a paired group plays the second place of the other paired group (e.g., the winner of group A plays the second of group B and the winner of B plays the second of A.)

The easiest way to think about the round robin is that there are 729 possible outcomes in a six game round robin (3 outcomes over 6 games is 3^6). Assuming independence between games (that the outcome of one game does not affect the outcome of another) we can easily determine the likelihood of any of the 729 possible outcomes from the likelihood of any of the three outcomes of the six games.

At that point we have the second round set, with certain probability, and can determine the likely wins between potential second round teams and so forth. Thus, providing both the likely outcome in any game and the likelihood of any team reaching any given round.

Of course, independence is not necessarily the correct choice for the World Cup; early games in the round robin affect later games in the round robin. I already noted that in the NBA some teams are more likely to win after a loss (due to either increased effort or referee’s calls). The opposite effect would be that we may learn that a team is better than we thought ex-ante due to them winning an earlier game. In the NBA they play 82 regular season games so we do not learn much if they happen to win a game in the playoffs, but in the World Cup they play 0 regular season games, so we learn a lot when the win a game. Thus, the consensus in our data is that we should slightly update teams after they win in the group stage. This is not significant in the later rounds, where all teams are winners, but it is in the round robin.

Prediction markets shine when there is a lot of idiosyncratic data making imprecise fundamental predictions. That is when we need the wisdom of the crowd to quantify the likely outcome. Thus, while we work through both the fundamental data and prediction market-based forecast, we put the weight of our prediction on the prediction market data.

Check out all of our World Cup coverage at: