explainer, prediction markets, probability

Evaluating probabilistic predictions

December 26, 2006 David Pennock 7 Comments

A number of naysayers [Daily Kos, The Register, The Big Picture, Reason] are discrediting prediction markets, latching onto the fact that markets like TradeSports and NewsFutures failed to call this year’s Democratic takeover of the US Senate. Their critiques reflect a clear misunderstanding of the nature of probabilistic predictions, as many others [Emile, Lance] have pointed out. Their misunderstanding is perhaps not so surprising. Evaluating probabilistic predictions is a subtle and complex endeavor, and in fact there is no absolute right way to do it. This fact may pose a barrier for the average person to understand and trust (probabilistic) prediction market forecasts.

In an excellent article in The New Republic Online [full text], Bo Cowgill and Cass Sunstein describe in clear and straightforward language the fallacy that many people seem to have made, interpreting a probabilistic prediction like “Democrats have a 25% chance of winning the Senate” as a categorical prediction “The Democrats will not win the Senate”. Cowgill and Sunstein explain the right way to interpret probabilistic predictions:

If you look at the set of outcomes estimated to be 80 percent likely, about 80 percent of them [should happen]; events estimated to be 70 percent likely [should] happen about 70 percent of the time; and so on. This is what it means to say that prediction markets supply accurate probabilities.

Technically, what Cowgill and Sunstein describe is called the calibration test. The truth is that the calibration test is a necessary test of prediction accuracy, but not a sufficient test. In other words, for a predictor to be considered good it must pass the calibration test, but at the same time some very poor or useless predictors may also pass the calibration test. Often a stronger test is needed to truly evaluate the accuracy of probabilistic predictions.

For example, suppose that a meteorologist predicts the probability of rain every day. Now suppose this meteorologist is lazy and he predicts the same probability every day: he simply predicts the annual average frequency of rain in his location. He doesn’t ever look at cloud cover, temperature, satellite imagery, computer models, or even whether it rained the day before. Clearly, this meteorologistâ€™s predictions would be uninformative and nearly useless. However, over the course of a year, this meteorologist would perform very well according to the calibration test. Assume it rains on average 10% of the time in the meteorologistâ€™s city, so he predicts “10% chance” every day. If we test his calibration, we find that, among all the days he predicted a 10% chance of rain (i.e., every day), it actually rained about 10% of the time. This lazy meteorologist would get a nearly perfect score according to the calibration test. A hypothetical competing meteorologist who actually works hard to consider all variables and evidence, and who thus predicts different percentages on different days, could do no better in terms of calibration.

The above example suggests that good predictions are not just well calibrated: good predictions are, in some sense, both variable AND well calibrated. So what is the “right” way to evaluate probabilistic predictions? There is no single absolute best way, though several tests are appropriate, and probably can be considered stronger tests than the calibration test. In our paper “Does Money Matter?” we use four evaluation metrics:

Absolute error: The average over many events of lose_PR, the probability assigned to the losing outcome(s)
Mean squared error: The square root of the average of (lose_PR)²
Quadratic score: The average of 100 – 400*(lose_PR)²
Logarithmic score: The average of log(win_PR), where win_PR is the probability assigned to the winning outcome

Note that the absolute value of these metrics is not very meaningful. The metrics are useful only when comparing one predictor against another (e.g., a market against an expert).

My personal favorite (advocated in papers and presentations) is the logarithmic score. The logarithmic score is one of a family of so-called proper scoring rules designed so that an expert maximizes her expected score by truthfully reporting her probability judgment (the quadratic score is also a proper scoring rule). Stated another way, experts with more accurate probability judgments should be expected to accumulate higher scores on average. The logarithmic score is closely related to entropy: the negative of the logarithmic score gives the amount (in bits of information) that the expert is “surprised” by the actual outcome. Increases in logarithmic score can literally be interpreted as measuring information flow.

Actually, the task of evaluating probabilistic predictions is even trickier than I’ve described. Above, I said that a good predictor must at the very least pass the calibration test. Actually, that’s only true when the predicted events are statistically independent. It is possible for a perfectly valid predictor to appear miscalibrated when the events he or she is predicting are highly correlated, as discussed in a previous post.

events, prediction markets, yahoo

confab.yahoo: Thanks everyone!

December 16, 2006 David Pennock 8 Comments

Thanks to all two hundred and seventy (!) of you who attended the confab.yahoo last Wednesday, as far as I know a record audience for an event devoted to prediction markets. [View pictures]

Thanks for spending your evening with us. Thanks for waiting patiently for the pizza and books! Thanks to the speakers (Robin, Eric, Bo, Leslie, myself, Todd, Chris, and Adam) who, after all, make or break any conference: in this case IMO definitely â€œmakeâ€. The speakers delivered wit and wisdom, and did it within their allotted times! Itâ€™s nice to see Google, HP, Microsoft, and Yahoo! together in one room discussing a new technology and — go figure — actually agreeing with one another for the most part. Thanks to James Surowiecki for his rousing opening remarks and for doing a fabulous job moderating the event. Thanks to the software demo providers Collective Intellect, HedgeStreet, HSX, and NewsFutures: next time weâ€™d like to give that venue more of the attention is deserved. Thanks to Yahoo! TechDev and Yahoo! PR for planning, marketing, and executing the event. A special thanks to Chris Plasser, who orchestrated every detail from start to finish flawlessly while juggling his day job, making it all look easy in the process.

Many media outlets and bloggers attended. Nice articles appear in ZDNet and CNET, the latter of which was slashdotted yesterday. The local ABC 11 o’clock news even featured a piece on the event [see item #35 in this report]. I’m collecting additional items under MyWeb tag ‘confab.yahoo’.

CNET and Chris Masse (on Midas Oracle) provide excellent summaries of the technical content of the event. So Iâ€™ll skip any substantive comments (for now) and instead mention a few fun moments:

Bo began by staring straight into the camera and giving a shoutout to Chris Masse, the eccentric Frenchman who also happens to be a sharp, tireless, and invaluable (and donâ€™t forget bombastic) chronicler of the prediction markets field via his portal and blog.
Todd had the audience laughing with his story of how a prediction market laid bare the uncomfortable truth about an inevitable product delay, to the incredulousness of the productâ€™s manager. (Todd assured us that this was a Microsoft internal product, not a consumer-facing product.)
I had the unlucky distinction of being the only speaker to suffer from technical difficulties in trying to present from my own Mac Powerbook instead of the provided Windows laptop. Todd later admitted that he was tempted to make a Windows/Mac quip like â€œWindows just worksâ€.
Adam finished with an Jobsian â€œone more thingâ€ announcement of their latest effort, worthio, a secret project theyâ€™ve been hacking away at nights and weekends even as they operate their startup Inkling at full speed ahead. (Yesterday Adam blogged about the confab.)

Our Yootles currency seems to have caught the public’s imagination more than any of the other various topics I covered in my own talk. (Whatâ€™s wrong with you folks? Youâ€™re not endlessly fascinated with the gory mathematical details of my dynamic parimutuel market mechanism? ;-)) And so a meme is born. The lead on the Yootles project is Daniel Reeves and he is eager to answer questions and hear your feedback.

I enjoyed the confab immensely and it was great to meet so many people: thanks for the kind words from so many of you. Thanks again to the speakers, organizers, media, and attendees. I hope the event was valuable to you. Archive video of the event is available [100k|300k] for those who could not attend in person.

events, prediction markets, yahoo

confab.yahoo update

December 11, 2006 David Pennock 1 Comment

Here is an update on the confab.yahoo on prediction markets happening this Wed Dec 13 at 5:30pm at Yahoo!’s Sunnyvale headquarters, Building C, Classroom 5.

We’ve added Stanford b-school professor Eric Zitzewitz as a speaker
We’ll hold an ad-hoc vendor session immediately following the event, tentatively featuring Collective Intellect, HedgeStreet, HSX, Inkling Markets, NewsFutures, and RIMDEX
There will be food!
We’ll be giving away a limited number of copies of Surowiecki’s book The Wisdom of Crowds
We’re planning to webcast the event at two connection speeds: 100k | 300k

Again, the event is free and open to the public. Hope to see you there!

fun, yahoo

The Jelly Manifesto

December 11, 2006 David Pennock 3 Comments

Jelly is sticky. Spread it everywhere and see where users stick.

Oddhead Blog

Monthly Archives: December 2006

Evaluating probabilistic predictions

confab.yahoo: Thanks everyone!

confab.yahoo update

The Jelly Manifesto

Musings of a computer scientist on predictions, odds, and markets