The "predict flu using search" study you didn't hear about

In October, Philip Polgreen, Yiling Chen, myself, and Forrest Nelson (representing University of Iowa, Harvard, and Yahoo!) published an article in the journal Clinical Infectious Diseases titled “Using Internet Searches for Influenza Surveillance”.

The paper describes how web search engines may be used to monitor and predict flu outbreaks. We studied four years of data from Yahoo! Search together with data on flu outbreaks and flu-related deaths in the United States. All three measures rise and fall as flu season progresses and dissipates, as you might expect. The surprising and promising finding is that web searches rise first, one to three weeks before confirmed flu cases, and five weeks before flu-related deaths. Thus web searches may serve as a valuable advance indicator for health officials to spot the onset of diseases like the flu, complementary to other indicators and forecasts.

On November 11, the New York Times broke a story about Google Flu Trends, along with an unusual announcement of a pending publication in the journal Nature.

I haven’t read the paper, but the article hints at nearly identical results:

Google … dug into its database, extracted five years of data on those queries and mapped it onto the C.D.C.’s reports of influenzalike illness. Google found a strong correlation between its data and the reports from the agency…

Tests of the new Web tool … suggest that it may be able to detect regional outbreaks of the flu a week to 10 days before they are reported by the Centers for Disease Control and Prevention.

To the reporter’s credit, he interviewed Phillip and the article does mention our work in passing, though I can’t say I’m thrilled with the way it was framed:

The premise behind Google Flu Trends … has been validated by an unrelated study indicating that the data collected by Yahoo … can also help with early detection of the flu.

giving (grudging) credit to Yahoo! data rather than Yahoo! people.

The story slashdigged around the blogomediasphere quickly and thoroughly, at one point reaching #1 on the nytimes.com most-emailed list. Articles and comments praise how novel, innovative, and outside-of-the-box the idea is. The editor in chief of Nature praised the “exceptional public health implications of [the Google] paper.”

I’m thrilled to see the attention given to the topic, and the Google team deserves a huge amount of credit, especially for launching a live web site as a companion to their publication, a fantastic service of great social value. That’s an idea we had but did not pursue.

In the business world, being first often means little. However in the world of science, being first means a great deal and can be the determining factor in whether a study gets published. The truth is, although the efforts were independent, ours was published first — and Clinical Infectious Diseases scooped Nature — a decent consolation prize amid the go-google din.

Update 2008/11/24: We spoke with the Google authors and the Nature editors and our paper is cited in the Google paper, which is now published, and given fair treatment in the associated Nature News item. One nice aspect of the Google study is that they identified relevant search terms automatically by regressing all of the 50 million most frequent search queries against the CDC flu data. Congratulations and many thanks to the Google/CDC authors and the Nature editors, and thanks everyone for your comments and encouragement.

12 thoughts on “The "predict flu using search" study you didn't hear about”

  1. Hi, David. It is unfortunate that the press did not give sufficient credit to prior work on this. The underlying idea is clever and powerful.

    However, I think it is fair to interpret the press attention as being due to the live web site, not the Nature article.

    Google made the data publicly accessible in a fun and interesting way that reporters can point to and people can try out. I think that is why the press and public reacted with such enthusiasm.

    As we all know, there is a big difference between discussing an idea in the research community and writing a tool based on that idea that attracts the interest of a mass market audience. I think the live web site was not just a companion to their publication, but the reason for the attention.

  2. Hey David, I feel your pain, popular bloggers and media frequently ignore the root development of an idea in exchange for “fun and interesting ways” to write something about Google. And, Google folks seem to develop a lot of projects without giving attribution to original work.

    What’s easy to miss here is that without correct attribution we decrease our ability to correctly allocate resources to important research. For example, as flu epidemiology continues to increase in importance I would prefer grants to go to serious researchers rather than ad-hoc “fun and interesting” sites already funded by the ad dollars of the masses.

  3. Greg: Thanks for the comment. I agree with you on almost everything: the live site represents a great deal of effort (orders of magnitude more than the publication), provides significant social value, and it likely the main source of enthusiasm. Still, much of the praise has to do with the novelty of the idea, and that is what I take issue with. I believe a peer-reviewed publication represents something more tangible than “discussing an idea in the research community”, though less tangible than an implemented system. I’m mostly referring to the Nature paper, not the Flu Trends site which I agree is great and worthy of praise. Finally, I may be jaded but I believe if Yahoo!, Microsoft, or AOL launched Flu Trends, the reception would have been less enthusiastic.

    Gordon, thanks very much for the comment: agreed.

  4. That’s a good point, David, that the press is praising Google for the novelty of the idea. That the praise is more appropriate for the novelty of the implemented system, not the idea, I think we agree.

    Ironically, the reverse situation happened to me in the past. The algorithm for Amazon’s recommendations was implemented, used by many, and disclosed publicly (in a patent). But, a research paper that came well after anyone could read about the technique we used is often cited in the research community as the first instance of the idea.

    We attempted to correct this by later publishing an academic article that contained much of the same material in our earlier disclosure, but this time in a forum that was directed toward the research community, but still the other paper is often cited by academic researchers as the first instance.

    In any case, thanks for setting the record straight with your reference to your prior work on the topic. It is a fun and remarkably powerful idea, detecting flu outbreaks using search activity, and the general category of using online behavior trends to predict offline trends is fascinating and shows much promise.

  5. I think this is another example of the problems behind the long lag times in journal publication.

    The timescale quoted on the Clinical Infectious Diseases website for your paper is, “Received 8 May 2008; accepted 11 August 2008; electronically published 27 October 2008.” You may have done this research a while back but the first Google could have heard about this would have been just over 2 weeks ago. We can’t see the timescales for the Google Nature submission yet but my guess is that it could well have been with Nature before your publication.

    It’s a pity that Google got all the press coverage but they’re probably blameless and a victim of the secrecy in journals’ pre-press periods.

  6. Greg: thanks: yes, credit assignment is hard, and even harder to correct after the fact. Probably in the grand scheme I should worry less about credit than progress. Usually I find myself in the opposite situation where an idea I have turns out to have been published previously. Almost every good idea has likely been thought of by someone somewhere.

    Edward: thanks: I agree journal lag times are a problem, in fact the whole journal system with copyrights, fees, and embargoes, seems to hinder more than help, especially now that high quality self publishing is near costless. Likely the Google folks are blameless: I agree. We did present the work to a scientific conference in March 2008 (here is an obscure online reference: “The 2008 late-breaker session will cover topics that include tracking influenza through internet searches”), but I can’t fault anyone for not being aware of every related presentation or even publication out there.

    Here another tidbit for the conspiracy theory: The same abstract was rejected from a CDC-sponsored meeting (ICEID) in November 2007. It was the only abstract that was rejected out of five abstracts that Phillip submitted, and CDC is a partner of the Google effort. Though it’s fun to contemplate some grand conspiracy, I’m sure it was coincidence.

  7. Now you understand exactly how my partner and me were feeling when GOOGLE announced GOOGLE SKY as something absolutely new and the best when SKY-MAP.ORG was already online for more then a year! But you are in YAHOO and you at least can make your voice louder.

  8. Thanks Kostya. Sky-Map.org (alias WikiSky.org) is fantastic. Here is my delicious.com quote:

    “Crazy cool and comprehensive map of the universe: virtual view of the night sky: zoom into to incredible depth anywhere and look at high resolution photos from the Hubble telescope, etc., of significant objects.”

    Keep up the great work. Thanks for your note: it helps keep things in perspective: your situation is even more clear and more deserving of credit with less ability to voice.

  9. Postscript:

    We spoke with the Google authors and the Nature editors and our paper will be cited in the Google Nature paper. Many thanks to the Google authors and the Nature editors, and thanks everyone for your comments and encouragement.

  10. David,
    That is certainly a great and very practical idea – analysis of the correlation between the density of subject-related queries and (the same subject)-related social events (epidemics in your case). But I think that it may be even more interesting to analyze it from more common perspective. I couldn’t find the text of your article (and would be grateful if you will find it possible to provide it for me) but from http://www.journals.uchicago.edu/doi/abs/10.1086/593098 it seems like you used some function F(q)->Q (q-query, Q belongs to R) for query’s classification (to determine if the particular query q is “influenza-related”). In this case the choice of F is critical and always questionable. Did you think about the opposite approach – to determine “influenza-sensitive” classes of queries first? In my opinion that may allow more accurate predictions because people on very early stages of the disease may subconsciously react, generating well-classified queries, not related directly to influenza at all. The class of functions F may be chosen in many different ways. For example it may be a simple neural network.

    In general I think that the analysis of search-engine queries may be extremely effective tool very much applicable for the society-control.

    Thanks.

    Another thing-
    Thank you David for your compliments – SKY-MAP is our lovely child and we will continue working on it as long as we stay ourselves. But this baby is growing rapidly and it is becoming harder for us to wear it. We are trying to find people and/or organizations that would be interested in cooperation. What we actually think – if Google has something and MS has it but Yahoo doesn’t – may be Yahoo would be interested in having something similar? Do you know any people in Yahoo who might be interested in talking to us?

    Any case – thanks a lot,
    K. Lysenko.

  11. Kostya,

    Actually our method is even more naive: we identify influenza-related queries manually. The Google study is more sophisticated: they identify queries automatically using feature selection.

    Thanks for the note about SKY-MAP: I will circulate your proposal.

Leave a Reply

Your email address will not be published. Required fields are marked *