Category Archives: computer science

Where to find the Yahoo!-Google letter to the CFTC about prediction markets

At the Prediction Markets Summit1 last Friday April 24 2009, I mentioned that Yahoo! and Google jointly wrote a letter to the U.S. Commodity Futures Trading Commission encouraging the legalization of small-stakes real-money prediction markets, and that Microsoft had recently written its own letter in support of the effort. (The CFTC maintains a list of all public comments responding to their request for advice on regulating prediction markets.)

I told the audience that they could learn more by searching for “cftc yahoo google” in their favorite search engine, showing the Yahoo! Search results with MidasOracle’s coverage at the top.2

It turns out that was poor advice. 63.7% of the audience probably won’t find what they’re looking for using that search.3


Yahoo! versus Google search for "cftc yahoo google"

If some search engines don’t surface the MidasOracle post, I’m hoping they’ll find this.

And back to the effort to guide the CFTC: I hope other people and companies will join. The CFTC’s request for help itself displays a clear understanding of the science and practice of prediction markets and a real willingness to listen. The more organizations that speak out in support, the greater chance we have of convincing the CFTC to take action and open the door to innovation and experimentation.

1Which I hesitated to attend and host a reception for and now regret endorsing in any way.
2In September 2008, journalist Chris Masse uncovered the letter on the CFTC website before Google or Yahoo! had announced it. We should have known: Masse is extraordinarily skilled at finding anything relevant anywhere, and has been a tireless, invaluable (and unpaid) chronicler of all-things-prediction-markets for years now.
3Even Microsoft Live has the “right” result in position 3. Interestingly, Daniel Reeves got slightly different, presumably personalized, results in Google, even less excuse for not knowing what two MO junkies were looking for with that query.

Data-driven Dukie

“The No-Stats All-Star” is an entertaining, fascinating, and — warning — extremely long article by Michael Lewis in the New York Times Magazine on Shane Battier, a National Basketball Association player and Duke alumni whose intellectual and data-driven play fits perfectly into the Houston Rockets’s new emphasis on statistical modeling.

For Battier, every action is a numbers game, an attempt to maximize the probability of a good outcome. Any single outcome, good or bad, cannot be judged in isolation, as much as human nature desires it. Actions and outcomes have to be evaluated in aggregate.

Michael Lewis is a fantastic writer. Battier is an impressive player and an impressive person. Houston is not the first and certainly not the last sports team to turn to data as the arbiter of truth. This approach is destined to spread throughout industry and life, mostly because it’s right. (Yes, even for choosing shades of blue.)

Pricing the cloud, circa 1968

This article (membership required) is remarkable mostly for the fact that it was published in 1968. (Hat tip to Jonathan Smith.) It describes an experiment in creating an artificial economy to buy and sell computer time in the cloud, an idea that has been kicked around a number of times in the intervening decades but never quite took hold, until recently if you count literal pricing in dollars in EC2. The concept of buying time on your company’s compute cluster in a pseudo currency may come back into vogue as such installations become commonplace and over demanded.

Also check out the hand drawn figure and the advertisement at the end:


COBOL extensions to handle  data bases

Yahoo! Key Scientific Challenges student seed program

Yahoo! Research just published its list of key scientific challenges facing the Internet industry.

It’s a great resource for students to learn about the area and find meaty research problems. There’s also a chance for graduate students to earn $5000 in seed funding, work with Yahoo! Research scientists and data, and attend a summit of like-minded students and scientists.

The challenges cover search, machine learning, data management, information extraction, economics, social science, statistics, multimedia, and computational advertising.

Here’s the list of challenges from the algorithmic economics group, my group. We hope it provides a clear picture of the goals of our group and the areas where progress is most needed.

We look forward to supporting students who love a challenge and would like to join us in building the next-generation Internet.


Yahoo! Key Scientific Challenges Program 2009

Intelligent blog spam

As I alluded to previously, I seem to be getting “intelligent spam” on my blog: comments that pass the re-captcha test and seem on-topic, yet upon further inspection clearly constitute link spam: either the author URI or a link in the comment body is spam.

Here is one of the most clear cases, received on January 9 as a comment to my post on the CFTC’s call for proposals to regulate prediction markets:

Date: Fri, 9 Jan 2009 01:28:01 -0800
From: Matt.Herdy
New comment on your post #71 “A historic MayDay: The US
government’s call for help on regulating prediction markets”
Author : Matt.Herdy
Comment:
Thanks for that post. I’ll put a note in the post.

1. It’s nothing new. The CFTC will just formalize the current
status quo.
2. We are prisoner of the CFTC regulations and the US Congress’
distaste of sports “gambling”. As for the profitability of prediction
exchanges in that strict environment, I don’t see how you can deny that
HedgeStreet went bankrupt even though it was well funded. Isn’t that a
hard fact?
3. You’re right, but all “pragmatists” should follow a business
plan and make profits. See point #2. Pragmatists won’t make miracles.

<a href=”http://www.stretch-marks-help.com/”>Removing stretch marks</a>

At first blush, the comments seems to come from a knowledgeable person: they refer to HedgeStreet, an extremely relevant yet mostly unknown company that’s not mentioned anywhere else in the post or other comments.

It turns out the comments seem intelligent because they are. In fact, they’re copied word for word from Chris Masse’s comments on his own blog.

Chris Masse’s page has a link to my page, so it could have been discovered with a “link:” query to a search engine.

Though now I understand what this spammer did, I remain puzzled exactly how they did it and especially why.

  1. Are these comments being inserted by people, perhaps hired on Mechanical Turk or other underground equivalent? Or are they coming from robots who have either broken re-captcha or the security of my blog? (John suspects a security breach.)
  2. Is it really worth it economically? All links in blog comments are NOFOLLOW links anyway, and disregarded by search engines for ranking purposes, so what is the point? Are they looking for actual humans to click these links?

In any case, it seems an intriguing development in the spam arms race. Are other bloggers getting “intelligent spam”? Does anyone know how it’s done and why?

Update 2010/07: Oh, the irony. I got a number of intelligent seeming comments on this post about SEO, nofollow, economics of spam, etc. that were… promoting spammy links. I left them for humor value though disabled the links.

The "predict flu using search" study you didn't hear about

In October, Philip Polgreen, Yiling Chen, myself, and Forrest Nelson (representing University of Iowa, Harvard, and Yahoo!) published an article in the journal Clinical Infectious Diseases titled “Using Internet Searches for Influenza Surveillance”.

The paper describes how web search engines may be used to monitor and predict flu outbreaks. We studied four years of data from Yahoo! Search together with data on flu outbreaks and flu-related deaths in the United States. All three measures rise and fall as flu season progresses and dissipates, as you might expect. The surprising and promising finding is that web searches rise first, one to three weeks before confirmed flu cases, and five weeks before flu-related deaths. Thus web searches may serve as a valuable advance indicator for health officials to spot the onset of diseases like the flu, complementary to other indicators and forecasts.

On November 11, the New York Times broke a story about Google Flu Trends, along with an unusual announcement of a pending publication in the journal Nature.

I haven’t read the paper, but the article hints at nearly identical results:

Google … dug into its database, extracted five years of data on those queries and mapped it onto the C.D.C.’s reports of influenzalike illness. Google found a strong correlation between its data and the reports from the agency…

Tests of the new Web tool … suggest that it may be able to detect regional outbreaks of the flu a week to 10 days before they are reported by the Centers for Disease Control and Prevention.

To the reporter’s credit, he interviewed Phillip and the article does mention our work in passing, though I can’t say I’m thrilled with the way it was framed:

The premise behind Google Flu Trends … has been validated by an unrelated study indicating that the data collected by Yahoo … can also help with early detection of the flu.

giving (grudging) credit to Yahoo! data rather than Yahoo! people.

The story slashdigged around the blogomediasphere quickly and thoroughly, at one point reaching #1 on the nytimes.com most-emailed list. Articles and comments praise how novel, innovative, and outside-of-the-box the idea is. The editor in chief of Nature praised the “exceptional public health implications of [the Google] paper.”

I’m thrilled to see the attention given to the topic, and the Google team deserves a huge amount of credit, especially for launching a live web site as a companion to their publication, a fantastic service of great social value. That’s an idea we had but did not pursue.

In the business world, being first often means little. However in the world of science, being first means a great deal and can be the determining factor in whether a study gets published. The truth is, although the efforts were independent, ours was published first — and Clinical Infectious Diseases scooped Nature — a decent consolation prize amid the go-google din.

Update 2008/11/24: We spoke with the Google authors and the Nature editors and our paper is cited in the Google paper, which is now published, and given fair treatment in the associated Nature News item. One nice aspect of the Google study is that they identified relevant search terms automatically by regressing all of the 50 million most frequent search queries against the CDC flu data. Congratulations and many thanks to the Google/CDC authors and the Nature editors, and thanks everyone for your comments and encouragement.

WeatherBill shows the way toward usable combinatorial prediction markets

WeatherBill let’s you construct an enormous variety of insurance contracts related to weather. For example, the screenshot embedded below shows how I might have insured my vacation at the New Jersey shore:

Read this document on Scribd: WeatherBill Example Contract

For $42.62 I could have arranged to be paid $100 per day of rain during my vacation.

(I didn’t actually purchase this mainly because the US government insists that I am a menace to myself and should not be allowed to enter into such a dangerous gamble — more on this later. And as Dan Reeves pointed out to me, it’s probably not rational to do for small sums.)

WeatherBill is an example of the evolution of financial exchanges as they embrace technology.

WeatherBill can be thought of as expressive insurance, a financial category no doubt poised for growth and a wonderful example of how computer science algorithms are finally supplanting the centuries-old exchange logic designed for humans (CombineNet is another great example).

WeatherBill can also be thought of as a combinatorial prediction market with an automated market maker, a viewpoint I’ll expand on now.

On WeatherBill, you piece together contracts by specifying a series of attributes: date range, place, type of weather, threshold temperature or participation level, minimum and maximum number of bad-weather days, etc. The user interface is extremely well done: a straightforward series of adaptive menu choices and text entry fields guide the customer through the selection process.

This flexibility quickly leads to a combinatorial explosion: given the choices on the site I’m sure the number of possible contracts you can construct runs into the millions.

Once you’ve defined when you want to be paid — according to whatever definition of bad weather makes sense for you or your business — you choose how much you want to be paid.

Finally, given all this information, WeatherBill quotes a price for your custom insurance contract, in effect the maximum amount you will lose if bad weather doesn’t materialize. Quotes are instantaneous — essentially WeatherBill is an automated market maker always willing to trade at some price on any of millions of contracts.

Side note: On WeatherBill, you control the magnitude of your bet by choosing how much you want to be paid. In a typical prediction market, you control magnitude by choosing how many shares to trade. In our own prediction market Yoopick, you control magnitude by choosing the maximum amount you are willing to lose. All three approaches are equivalent, and what’s best depends on context. I would argue that the WeatherBill and Yoopick approaches are simpler to understand, requiring less indirection. The WeatherBill approach seems most natural in an insurance context and the Yoopick approach in a gambling context.

How does the WeatherBill market maker determine prices? I don’t know the details, but their FAQ says that prices change “due to a number of factors, including WeatherBill forecast data, weather simulation, and recent Contract sales”. Certainly historical data plays an important role — in fact, with every price quote WeatherBill tells you what you would have been paid in years past. They allow contracts as few as four days into the future, so I imagine they incorporate current weather forecasts. And the FAQ implies that some form of market feedback occurs, raising prices on contract terms that are in high demand.

Interface is important. WeatherBill shows that a very complicated combinatorial market can be presented in a natural and intuitive way. Though greater expressiveness can mean greater complexity and confusion, Tuomas Sandholm is fond of pointing out that, when done right, expressiveness actually simplifies things by allowing users to speak in terms they are familiar with. WeatherBill — and to an extent Yoopick IMHO — are examples of this somewhat counterintuitive principle at work.

There is another quote from WeatherBill’s FAQ that alludes to an even higher degree of combinatorics coming soon:

Currently you can only price contracts based on one weather measurement. We’re working on making it possible to use more than one measurement, and hope to make it available soon.

If so, I can imagine the number of possible insurance contracts quickly growing into the billions or more with prices hinging on interdependencies among weather events.

Finally, back to the US government treating me like a child. It turns out that only a very limited set of people can buy contracts on WeatherBill, mainly businesses and multi-millionaires who aren’t speculators. In fact, the rules of who can play are a convoluted jumble that I believe are based on regulations from the US Commodity Futures Trading Commission.

Luckily, WeatherBill provides a nice “choose your own adventure” style navigation flow to determine whether you are allowed to participate. Most people will quickly find they are not eligible. (I don’t officially endorse the CYOA standard of re-starting over and over again until you pass.)

Even if red tape locks the average consumer out of direct access, clever companies are stepping in to mediate. In a nice intro piece on WeatherBill, Newsweek mentions that Priceline used WeatherBill to back a “Sunshine Guaranteed” promotion offering refunds to customers whose trips were rained out.

Can you think of other end-arounds to bring WeatherBill functionality to the masses? What other forms of expressive insurance would you like to see?

The seedy side of Amazon's Mechanical Turk

I mostly side with Lukas and Panos on the fantastic potential of Amazon’s Mechanical Turk, a crowdsourcing service specializing in tiny payments for simple tasks that require human brainpower, like labeling images. Within the field of computer science alone, this type of service will revolutionize how empirical research is done in communities from CHI to SIGIR, powering unprecedented speed and scale at low cost (here are two examples). My guess is that the impact will be even larger in the social sciences; already, a number of folks in Yahoo’s Social Dynamics research group have started running studies on mturk. (A side question is how university review boards will react.)

However there is a seedier side to mturk, and I’m of two minds about it. Some people use the service to hire sockpuppets to enter bogus ratings and reviews about their products and engage in other forms of spam. (Actually this appears to violate mturk’s stated policies.)

For example, Samuel Deskin is offering up to ten cents to turkers willing to promote his new personalized start page samfind.

EARN TEN CENTS WITH THE BONUS – EASY MONEY – JUST VOTE FOR US AND COMMENT ABOUT US

EARN FOUR CENTS IF YOU:

1. Set up an anoymous email account likke gmail or yahoo so you can register on #2 anonymously

2. Visit http://thesearchrace.com/signup.php and sign up for an account – using your anonymous email account.

3. Visit http://www.thesearchrace.com/recent.php and vote for:

samfind

By clcking “Pick”

SIX CENTS BONUS:

4. Visit the COMMENTS Page on The Search Race, it is the Button Right Next to “Picks” on this page: http://www.thesearchrace.com/recent.php and

5. Say something awesome about samfind (http://samfind.com) on The Search Race’s Comments page.

Make sure to:

1. Tell us that you Picked us.
2. Copy and Paste the Comment you typed on The Search Race’s Comment page here so we know you wrote it and we will give you the bonus!

In fact, Deskin is currently offering bounties on mturk for a number of different spammy activities to promote his site. On the other hand, what Deskin is doing is not illegal and is arguably not all that different than paying PRWEB to publish his rah-rah press release (Start-up, samfind, Launches Customizable Startpage to Compete with Google, Yahoo & MSN, Los Angeles, California (PRWEB) August 4, 2008). And I have to at least give him credit for offering the money under his own name.

Another type of task on mturk involves taking a piece of text and paraphrasing it so that the words are different but the meaning remains the same. Here is an example:

Paraphrase This Paragraph

Here’s the original paragraph:

You’re probably wondering how to apply a wrinkle filler to your skin. The good news is that it’s easy! There are a number of different products on the market for anti aging skin care. Each one comes with its own special application instructions, which you should always make sure to read and carefully follow. In general, however, most anti aging skin care products are simply applied to the skin and left to soak in.

Requirements:
1. Use the same writing style as much as possible.
2. Vary at least 50% of the words and phrases – but keep the same concepts. Use obviously different sentences! Your paragraph should not be just a copy of the first with a few word replacements.
3. Any keywords listed in bold in the above paragraph must be included in your paraphrase.
4. The above paragraph contains 75 words… yours must contain at least 64 words and not more than 101 words.
5. Write using American English.
6. No obvious spelling or grammar mistakes. Please use a spell-checker before submitting. A free online spell checker can be found at www.spellcheck.net.

If you find it easier to paraphrase sentence-by-sentence, then do that. Please do not enter anything in the textbox other than your written paragraph. Thanks!

I have no direct evidence, but I imagine such a task is used to create splogs (I once found what seems like such a “paraphrasing splog”), ad traps, email spam, or other plagiarized content.

It’s possible that paid spam is hitting my blog (either that or I’m overly paranoid). I’m beginning to receive comments that are almost surely coming from humans, both because they clearly reference the content of the post and because they pass the re-captcha test. However, the author’s URL seems to point to an ad trap. I wonder if these commenters (who are particularly hard to catch — you have to bother to click on the author URL) are paid workers of some crowdsourcing service?

Can and should Amazon try to filter away these kinds dubious uses of Mechanical Turk? Or is it better to have this inevitable form of economic activity out in the open? One could argue that at least systems like mturk impose a tax on pollution and spam, something long argued as an economic force to reduce spam.

My main objection to these activities is the lack of disclosure. Advertisements and press releases are paid for, but everyone knows it, and usually the funding source is known. However, the ratings, reviews, and paraphrased text coming out of mturk masquerade as authentic opinions and original content. I absolutely want mturk to succeed — it’s an innovative service of tremendous value, one of many to come out of Amazon recently — but I believe Amazon is risking a minor PR backlash by allowing these activities to flow through its servers and by profiting from them.

Reporting prediction market prices

Reuters recently ran a story on political prediction markets, quoting prices from intrade and IEM. (Apparently the story was buzzed up to the Yahoo! homepage and made the Drudge Report.)

The reporter phrased prices in terms of the candidates’ percent chance of winning:

Traders … gave Democratic front-runner Barack Obama an 86 percent chance of being the Democratic presidential nominee, versus a 12.8 percent for Clinton…

…traders were betting the Democratic nominee would ultimately become president. They gave the Democrat a 59.1 percent chance of winning, versus a 48.8 percent chance for the Republican.

The latter numbers imply an embarrassingly incoherent market, giving the Democrats and Republicans together a 107.9% chance of winning. This is almost certainly the result of a typo, since the Republican candidate on intrade has not been much above 40 since mid 2007.

Still, typos aside, we know that the last-trade prices of candidates on intrade and IEM often don’t sum to exactly 100. So how should journalists report prediction market prices?

Byrne Hobart suggests they should stick to something strictly factual like "For $4.00, an investor could purchase a contract which would yield $10.00" if the Republican wins.

I disagree. I believe that phrasing prices as probabilities is desirable. The general public understands “percent chance” without further explanation, and interpreting prices in this way directly aligns with the prediction market industry’s message.

When converting prices to probabilities, is a journalist obligated to normalize them so they sum to 100? Should journalists report last-trade prices or bid-ask spreads or something else?

My inclination is that bid-ask spreads are better. Something like "traders gave the Democrats between a 22 and 30 percent chance of winning the state of Arkansas". These will rarely be inconsistent (otherwise arbitrage is sitting on the table) and the phrasing is still relatively easy to understand.

Avoiding this (admittedly nitpicky) dilemma is another advantage of automated market makers like Hanson’s. The market maker’s prices always sum to exactly 100, and the bid, ask, and last-trade prices are one and the same. Auction-type mechanisms like intrade’s can also be designed better so that prices are automatically kept consistent.

A freakonomist takes on Big Weather and, … stumbles

It seems that even D.I.Y. freakonomists aren’t sure how to judge probability forecasts.

In measuring precipitation accuracy, the study assumed that if a forecaster predicted a 50 percent or higher chance of precipitation, they were saying it was more likely to rain than not. Less than 50 percent meant it was more likely to not rain.

That prediction was then compared to whether or not it actually did rain…