A number of naysayers [Daily Kos, The Register, The Big Picture, Reason] are discrediting prediction markets, latching onto the fact that markets like TradeSports and NewsFutures failed to call this year’s Democratic takeover of the US Senate. Their critiques reflect a clear misunderstanding of the nature of probabilistic predictions, as many others [Emile, Lance] have pointed out. Their misunderstanding is perhaps not so surprising. **Evaluating probabilistic predictions is a subtle and complex endeavor, and in fact there is no absolute right way to do it.** This fact may pose a barrier for the average person to understand and trust (probabilistic) prediction market forecasts.

In an excellent article in The New Republic Online [full text], Bo Cowgill and Cass Sunstein describe in clear and straightforward language the fallacy that many people seem to have made, interpreting a probabilistic prediction like “Democrats have a 25% chance of winning the Senate” as a categorical prediction “The Democrats will not win the Senate”. Cowgill and Sunstein explain the right way to interpret probabilistic predictions:

If you look at the set of outcomes estimated to be 80 percent likely, about 80 percent of them [should happen]; events estimated to be 70 percent likely [should] happen about 70 percent of the time; and so on. This is what it means to say that prediction markets supply accurate probabilities.

**Technically, what Cowgill and Sunstein describe is called the calibration test.** The truth is that the calibration test is a necessary test of prediction accuracy, but not a sufficient test. In other words,

**for a predictor to be considered good it must pass the calibration test, but at the same time some very poor or useless predictors may also pass the calibration test.**Often a stronger test is needed to truly evaluate the accuracy of probabilistic predictions.

For example, suppose that a meteorologist predicts the probability of rain every day. Now suppose this meteorologist is lazy and he predicts the same probability every day: he simply predicts the annual average frequency of rain in his location. He doesn’t ever look at cloud cover, temperature, satellite imagery, computer models, or even whether it rained the day before. Clearly, this meteorologist’s predictions would be uninformative and nearly useless. However, over the course of a year, this meteorologist would perform very well according to the calibration test. Assume it rains on average 10% of the time in the meteorologist’s city, so he predicts “10% chance” every day. If we test his calibration, we find that, among all the days he predicted a 10% chance of rain (i.e., every day), it actually rained about 10% of the time. This lazy meteorologist would get a nearly perfect score according to the calibration test. A hypothetical competing meteorologist who actually works hard to consider all variables and evidence, and who thus predicts different percentages on different days, could do no better in terms of calibration.

The above example suggests that **good predictions are not just well calibrated: good predictions are, in some sense, both variable AND well calibrated.** So what is the “right” way to evaluate probabilistic predictions? There is no single absolute best way, though several tests are appropriate, and probably can be considered stronger tests than the calibration test. In our paper “Does Money Matter?” we use four evaluation metrics:

**Absolute error:**The average over many events of lose_PR, the probability assigned to the losing outcome(s)**Mean squared error:**The square root of the average of (lose_PR)^{2}**Quadratic score:**The average of 100 – 400*(lose_PR)^{2}**Logarithmic score:**The average of log(win_PR), where win_PR is the probability assigned to the winning outcome

Note that the absolute value of these metrics is not very meaningful. The metrics are useful only when comparing one predictor against another (e.g., a market against an expert).

My personal favorite (advocated in papers and presentations) is the logarithmic score. The logarithmic score is one of a family of so-called *proper scoring rules* designed so that an expert maximizes her expected score by truthfully reporting her probability judgment (the quadratic score is also a proper scoring rule). Stated another way, experts with more accurate probability judgments should be expected to accumulate higher scores on average. The logarithmic score is closely related to entropy: the negative of the logarithmic score gives the amount (in bits of information) that the expert is “surprised” by the actual outcome. Increases in logarithmic score can literally be interpreted as measuring information flow.

Actually, the task of evaluating probabilistic predictions is even trickier than I’ve described. Above, I said that a good predictor must at the very least pass the calibration test. Actually, that’s only true when the predicted events are statistically independent. It is possible for a perfectly valid predictor to appear miscalibrated when the events he or she is predicting are highly correlated, as discussed in a previous post.

[...] Evaluating probabilistic predictions [...]

Aren’t metrics 2 and 3 mathematically equivalent? That is, can you find two sets of predictions x_1 .. x_n vs. y_1 … y_n where #2 prefers x_i and #3 prefers y_i?

Yes, you’re right: 2 & 3 will always rank predictors equivalently. (We used 3 because it’s the one used by the ProbabilitySports contest that we were comparing against.)

[...] #3. “implying infallibility of PM’s” – Nobody in the industry is claiming that. These naysayers are the leftists who had embraced prediction markets before the 2006 US elections and who are now turning against them. [...]

[...] #12. State that evaluating probabilistic predictions is very tricky —in case readers missed it the first time. [...]

[...] Evaluating probabilistic predictions — David Pennock Edition David Pennock: […] So what is the “right” way to evaluate probabilistic predictions? There is no single absolute best way, though several tests are appropriate, and probably can be considered stronger tests than the calibration test. In our paper “Does Money Matter?” we use four evaluation metrics: [...]

This is a common pitfall in the use of statistical data. Most people do not understand this. As a result, many professionls including advertisers and politicians use this loophole in twisting facts.