All posts by David Pennock

computer science, economics, explainer, finance, gambling, review, woblomo

Review of Fortune’s Formula by William Poundstone: The stranger-than-fiction tale of how to invest

March 1, 2010 David Pennock 2 Comments

What is a better investment objective?

Grow as wealthy as possible as quickly as possible, or
Maximize expected wealth for a given time period and level of risk

The question is at the heart of a fight between computer scientists and economists chronicled beautifully in the book Fortune’s Formula by Pulitzer Prize nominee William Poundstone. (See also David Pogue’s excellent review.*) From the book’s sprawling cast — Claude Shannon, Rudy Giuliani, Michael Milken, mobsters, and mob-backed companies (including what is now Time Warner!) — emerges an unlikely duel. Our hero, mathematician turned professional gambler and investor Edward Thorp, leads the computer scientists and information theorists preaching and, more importantly, practicing objective #1. Nobel laureate Paul Samuelson (who, sadly, recently passed away) serves as lead villain (and, to an extent, comic foil) among economists promoting objective #2 in often patronizing terms. The debate sank to surprisingly depths of immaturity, hitting bottom when Samuelson published an economist-peer-reviewed article written entirely in one-syllable words, presumably to ensure that his thrashing of objective #1 could be understood by even its nincompoop proponents.

Objective #1 — The Kelly criterion

Objective #1 is the have-your-cake-and-eat-it-too promise of the Kelly criterion, a money management formula first worked out by Bernoulli in 1738 and later rediscovered and improved by Bell Labs scientist John Kelly, proving a direct connection between Shannon-optimal communication and optimal gambling. Objective #1 matches common sense: who wouldn’t want to maximize growth of wealth? Thorp, college professor by day and insanely successful money manager by night, is almost certainly the greatest living example of the Kelly criterion at work. His track record is hard to refute.

If two twins with equal wealth invest long enough, the Kelly twin will finish richer with 100% certainty.

The Kelly criterion dictates exactly what fraction of wealth to wager on any available gamble. First consider a binary gamble that, if correct, pays $x for every $1 risked. You estimate that the probability of winning is p. As Poundstone states it, the Kelly rule says to invest a fraction of your wealth equal to edge/odds, where edge is the expected return per $1 and odds is the payoff per $1. Substituting, edge/odds = (x*p – 1*(1-p))/x. If the expected return is zero or negative, Kelly sensibly advises to stay away: don’t invest at all. If the expected return is positive, Kelly says to invest some fraction of your wealth proportional to how advantageous the bet is. To generalize beyond a single binary bet, we can use the fact that, as it happens, the Kelly criterion is entirely equivalent to (1) maximizing the logarithm of wealth, and (2) maximizing the geometric mean of gambles.

Investing according to the Kelly criterion achieves objective #1. The strategy provably maximizes the growth rate of wealth. Stated another way, it minimizes the time it takes to reach any given aspiration level, say $1 million, or your desired sized nest egg for retirement. If two twins with equal initial wealth were to invest long enough, one according to Kelly and the other not, the Kelly twin would finish richer with 100% certainty.

Objective #2

Objective #2 refers to standard economic dogma. Low-risk/high-return investments are always preferred to high-risk/low-return investments, but high-risk/high-return and low-risk/low-return are not comparable in general. Deciding between these is a personal choice, a function of the decision maker’s risk attitude. There is no optimal portfolio, only an efficient frontier of many Pareto optimal portfolios that trade off risk for return. The investor must first identify his utility function (how much he values a dollar at every level of wealth) in order to compute the best portfolio among the many valid choices. (In fact, objective #1 is a special case of #2 where utility for money is logarithmic. Deriving rather than choosing the best utility function is anathema to economists.)

Objective #2 is straightforward for making one choice for a fixed time horizon. Generalizing it to continuous investment over time requires intricate forecasting and optimization (which Samuelson published in his 1969 paper “Lifetime portfolio selection by dynamic stochastic programming”, claiming to finally put to rest the Kelly investing “fallacy” — p.210). The Kelly criterion is, astonishingly, a greedy (myopic) rule that at every moment only needs to worry about figuring the current optimal portfolio. It is already, by its definition, formulated for continuous investment over time.

Details and Caveats

There is a subtle and confusing aspect to objective #1 that took me some time and coaching from Sharad and Dan to wrap my head around. Even though Kelly investing maximizes long-term wealth with 100% certainty, it does not maximize expected wealth! The proof of objective #1 is a concentration bound that appeals to the law of large numbers. Wealth is, eventually, an essentially deterministic quantity. If a billion investors played non-Kelly strategies for long enough, then their average wealth might actually be higher than a Kelly investor’s wealth, but only a few individuals out of the billion would be ahead of Kelly. So, non-Kelly strategies can and will have higher expected wealth than Kelly, but with probability approaching zero. Note that, while Kelly does not maximize expected (average) wealth, it does maximize median wealth (p.216) and the mode of wealth. See Chapter 6 on “Gambling and Data Compression” (especially pages 159-162) in Thomas Cover’s book Elements of Information Theory for a good introduction and concise proof.

Objective #1 does have important caveats, leading to legitimate arguments against pure Kelly investing. First, it’s often too aggressive. Sure, Kelly guarantees you’ll come out ahead, but only if investing for “long enough”, a necessarily vague phrase that could mean, well, infinitely long. (In fact, a pure Kelly investor at any time has a 1 in n chance of losing all but 1/n of their wealth — p.229) The guarantee also only applies if your estimate of expected return per dollar is accurate, a dubious assumption. So, people often practice what is called fractional Kelly, or investing half or less of whatever the Kelly criterion says to invest. This admittedly starts down a slippery slope from objective #1 to objective #2, leaving the mathematical high ground of optimality to account for people’s distaste for risk. And, unlike objective #2, fractional Kelly does so in a non-principled way.

Even as Kelly investing is in some ways too aggressive, it is also too conservative, equating bankruptcy with death. A Kelly strategy will never risk even the most minuscule (measure zero) probability of losing all wealth. First, the very notion that each person’s wealth equals some precise number is inexact at best. People hold wealth in different forms and have access to credit of many types. Gamblers often apply Kelly to an arbitrary “casino budget” even though they’re an ATM machine away from replenishment. People can recover nicely from even multiple bankruptcies (see Donald Trump).

Some Conjectures

Objective #2 captures a fundamental trade off between expected return and variance of return. Objective #1 seems to capture a slightly different trade off, between expected return and probability of loss. Kelly investing walks the fine line between increasing expected return and reducing the long-run probability of falling below any threshold (say, below where you started). There are strategies with higher expected return but they end in ruin with 100% certainty. There are strategies with lower probability of loss but that grow wealth more slowly. In some sense, Kelly gets the highest expected return possible under the most minimal constraint: that the probability of catastrophic loss is not 100%. [Update 2010/09/09: The statements above are not correct, as pointed out to me by Lirong Xia. Some non-Kelly strategies can have higher expected return than Kelly and near-zero probability of ruin. But they will do worse than Kelly with probability approaching 1.]

It may be that the Kelly criterion can be couched in the language of computational complexity. Let Wt be your wealth at time t. Kelly investing grows expected wealth exponentially, something like E[Wt] = o(x^t) for x>1. It simultaneously shrinks the probability of loss, something like Pr(Wt< T) = o(1/t). (Actually, I have no idea if the decay is linear: just a guess.) I suspect that relaxing the second condition would not lead to much higher expected growth, and perhaps that fractional Kelly offers additional safety without sacrificing too much growth. If formalized, this would be some sort of mixed Bayesian and worst-case argument. The first condition is a standard Bayesian one: maximize expected wealth. The second condition — ensuring that the probability of loss goes to zero — guarantees that even the worst case is not too bad.

Conclusions

Fortune’s Formula is vastly better researched than your typical popsci book: Poundstone extensively cites and quotes academic literature, going so far as to unearth insults and finger pointing buried in the footnotes of papers. Pounstone clearly understands the math and doesn’t shy away from it. Instead, he presents it in a detailed yet refreshingly accessible way, leveraging fantastic illustrations and analogies. For example, the figure and surrounding discussion on pages 197-201 paint an exceedingly clear picture of how objectives #1 and #2 compare and, moreover, how #1 “wins” in the end. There are other gems in the book, like

Kelly’s quote that “gambling and investing differ only by a minus sign” (p.75)
Louis Bachelier’s discovery of the efficient market hypothesis in 1900, a development that almost no one noticed until after his death (p.120)
Poundstone’s assertion that “economists do not generally pay much attention to non-economists” (p.211). The assertion rings true, though to be fair applies to most fields and I know many glaring exceptions.
The story of the 1998 collapse of Long-Term Capital Management and ensuing bailout is sadly amusing to read today (p.290). The factors are nearly identical to those leading to the econalypse of 2008: leverage + correlation + too big to fail. (Poundstone’s book was published in 2005.) Will we ever learn? (No.)

Fortune’s Formula is a fast, fun, fascinating, and instructive read. I highly recommend it.

__________
* See my bookmarks for other reviews of the book and some related research articles.

events, fun, hacking, yahoo

Notes from Yahoo! Open Hack Day NYC

October 21, 2009 David Pennock 9 Comments

Here are my notes from Yahoo! Open Hack Day NYC. For other perspectives read New York Times open sourcerer Nick Thuesen or the Yahoo! devel blog. You can watch videos of some of the talks or browse pictures.

First off, I cheated. I went to sleep in a hotel room rather than hack all through the night. (Even in college I woke up at 4am rather than pull an all nighter.) Still, I made decent progress on some pet projects including combinatorial betting. Daniel, Sharad, and Winter from Yahoo! Research New York participated for real, working through the night. Returning in the morning showered and caffeinated to greet the sleepwalkers was a little surreal. A number of ex-Yahoos joined the festivities including David Yang, Mor Naaman, and Chad Dickerson. (Havi joked that Yahoo! is like finishing school for entrepreneurs. If you count Yahoo! capture and releases like Mark Cuban and Paul Graham, the spreading influence is enormous.)

Clay Shirky kicked off the event. He’s a fantastic speaker — watch his talk here. His punch line — that successful communities like facebook, twitter, flickr, and wikipedia start small and cohesive (as opposed to large and fragmented: see Yahoo! 360) — was aimed perfectly at the many founders and foundreamers in the audience. There were speakers from Mint and foursquare and tutorials on the Yahoo! Application Platform, Yahoo! Query Language (the most popular service), Yahoo! TV widgets, and more. There was a round of Ignite NYC, a barrage of twenty-slides-in-five-minutes talks, some educational (geek’s guide to patents), some charitable (aid to South America), some hilarious (spaceman from outerspace), some thought provoking (makerbot 3d printers), and many all of the above (meta mechanical turk; the Emoji translation of Moby Dick). Watch the Ignite talks here.

A bunch of small touches made the event memorable, including a steampunk-themed hacking hall complete with retroRed Victorian couches, portraits of hackers through history, funky tweet-streaming sculptures, chalk drawings of old patents, power cords dangling from hanging bird cages, and a guitarhero–foosball corner. The food was tasty and at times eccentric, like the hot dog stand and toppings bar under a rainbow umbrella, ice cream cart, and old-fashioned popcorn machine. There was plenty of beer, coffee, red bull, sliders, and cookies, and even (gasp) vegan fare, salmon, and salad.

I give the event an A for style (decor, food) and content (talks, hacks, organization). The one sour note was the wireless — certainly a key ingredient for a good hack day — which began flaky and ended slow but acceptable.

I attended the YAP tutorial and created a rudimentary application. I was pleasantly surprised how simple the process was — the documentation and sample code are great. You can get the hello world app (complete with social hooks) running and add some ajax magic within minutes.

By far one of the coolest sights was the MakerBot Industries 3D printer in action. It sucks in plastic wire, melts it, and deposits it in perfect formation to produce coins, busts, parts for itself, or almost anything in the thingiverse. For Hack Day, the device printed news headlines in peanut butter on toast. We met an nyc resistor who was working on a conveyer belt mechanism for his own MakerBot printer, and he invited us to craft night at their shared hackspace in Brooklyn (a place that would be heaven for my dad and brother; Sharad, Jake, Daniel, and Bethany went to check it out).

I missed the tutorial on Yahoo! TV widgets but I’d like to learn more. They are now in most major TV brands including Sony, Samsung, and LG — millions of sets around the world in the coming months. (The Sony won editor’s choice in the Sept 2009 issue of Wired magazine; the Samsung and LG rated close behind. The sole TV reviewed without Yahoo! Widgets, a Panasonic, was ridiculed for is clunky Viera Cast online interface.) If you’re an internet video startup, like my friend, you need a widget channel. Personally, I’d love to see a sports game tracker that highlights pivotal moments by monitoring in-game betting odds.

Footnote: Two Yahoos made a humorous video (that’s both self-promotional and -deprecating) on what people in Times Square think ‘hacker’ means:

See Paul Tarjan and Christian Heilmann for real definitions.

events, hacking, yahoo

Yahoo! Open Hack Day NYC, Oct 9-10, 2009

October 2, 2009 David Pennock 2 Comments

Join us on October 9, 2009 at the Millennium Broadway Hotel in New York City for Yahoo! Open Hack Day NYC. Come to listen, learn, and meet, but mainly come to make. Your goal: in 24 hours hackmash something together for bragging rights and prizes. Speakers include Clay Shirky (NYU), Carrie Cronkey (Mint.com), Dennis Crowley (foursquare), and Rasmus Lerdorf (inventor PHP). Register here. It’s free.

The 24-Hour Hackathon begins Friday afternoon. We encourage you to play around with Yahoo!â€™s Open Platforms and APIs like YAP, YQL, YUI, TVWidgets, our Social APIs, and more. And of course, feel free to use other APIs, developer tools and whatever software/hardware floats your boat…

At the end of the 24 hours, the hackers will have the chance to debut their hack and winners will be awarded with some enviable prizes…

And of course we will keep you well fed and hydrated throughout the two days. There will also be sleeping areas in case you want to take a nap.

Previous: the what and why of Open Hack.

advertising, computer science, economics, events, research, science

Upcoming CS-econ events: New York Computer Science and Economics Day and ACM Conference on Electronic Commerce

October 2, 2009 David Pennock

1. New York Computer Science and Economics Day (NYCE Day)

Monday, November 9, 2009 | 9:00 AM – 5:00 PM
The New York Academy of Sciences, New York, NY, USA

NYCE 2009 is the Second Annual New York Computer Science and Economics Day. The goal of the meeting is to bring together researchers in the larger New York metropolitan area with interests in Computer Science, Economics, Marketing and Business and a common focus in understanding and developing the economics of internet activity. Examples of topics of interest include theoretical, modeling, algorithmic and empirical work on advertising and marketing based on search, user-generated content, or social networks, and other means of monetizing the internet.

The workshop is soliciting rump session speakers until October 12. Rump session speakers will have 5 minutes to describe a problem and result, an experiment/system and results, or an open problem or a big challenge.

Invited Speakers

Larry Blume, Cornell University

Shahar Dobzinski, Cornell University

Michael Kearns, University of Pennsylvania

Jennifer Rexford, Princeton University

CFP: New York Computer Science and Economics Day (NYCE Day), Nov 9 2009

2. 11th ACM Conference on Electronic Commerce (EC’10)

June 7-11, 2010
Harvard University, Cambridge, MA, USA

Since 1999 the ACM Special Interest Group on Electronic Commerce (SIGecom) has sponsored the leading scientific conference on advances in theory, systems, and applications for electronic commerce. The Eleventh ACM Conference on Electronic Commerce (EC’10) will feature invited speakers, paper presentations, workshops, and tutorials covering all areas of electronic commerce. The natural focus of the conference is on computer science issues, but the conference is interdisciplinary in nature. The conference is soliciting full papers and workshop and tutorial proposals on all aspects of electronic commerce.

commentary, economics, politics, technology

The key to understanding net neutrality: Anonymity=good, egalitarianism=bad

August 6, 2009 David Pennock 11 Comments

For a long time I was terribly confused and conflicted about net neutrality (and embarrassed about being uncommitted on such a core issue in my industry). On the one hand, paying more for higher quality of service is only natural and leads to better provisioning of resources and less waste. HD movie watchers can pay for low latency streaming while email users need not. Treating their packets the same is madness, even worse legislating it so. On the other hand, many people I respect including economically literate ones vociferously argue for net neutrality. And Comcast “shaping” Skype traffic scores an 88 on the Ticketmaster scale of evil.

The key to understanding this debate is recognizing the difference between anonymity and egalitarianism. A mechanism is anonymous if the outcome does not depend on the identity of the players: two players who bid the same are treated equally. It doesn’t matter what their name, age, or wealth is, what company they represent, or how they plan to use the item — all that matters is what they bid. This is a good property for almost any public marketplace that ensures fair treatment, and one worth fighting for on the Internet. AppleT&T should not block Google Voice just because it’s a threat. In fact, even without legislation, it’s almost impossible to bar anonymous participation on the Internet. Service providers can, if forced to, encrypt their packets and hide their content, origin, and purpose, making them indistinguishable from others.

However no one would argue that everyone in a marketplace should receive identical resources. Players who bid more can and must be distinguished (for example, by winning more items) from players who bid less. So, while it’s wrong to discriminate based on identity, it’s absolutely essential to discriminate based on willingness to pay. That is the difference between an egalitarian lottery (silly) and an anonymous marketplace (good).

Somehow the net neutrality debate has confounded these two issues. I agree that any Internet constitution should include that all packets are equal regardless of their creator or purpose (charging $30 for “unlimited” data and in addition 30 cents per 160-char text message scores 72 on the ticketmasterindex). However, users or services who are willing to pay for it can and should receive higher quality. To do otherwise virtually guarantees wasting resources.

Update 2009/08/27: Mark Cuban (as always) says it well. [Via Tom Murphy]

commentary, computer science

A must read for computer scientists: Lance is right: Time to grow up

August 4, 2009 David Pennock 3 Comments

Lance Fortnow wrote a terrific op ed in the current issue of Communications of the ACM, arguing that the field of computer science should operate like most other academic disciplines and use journal publications as the primary measure of research contributions, freeing up conferences to serve a community role.

I agree (nearly) completely. The conference publication system is broken. Computer science papers are by and large not scholarly documents: many are sloppily written in deadline-driven haste with poor literature reviews, often blamed on page limits. Many reviews are rushed or cursory and decisions are safe at best, arbitrary at worst. The conference system encourages balkanization and discourages the emergence of a unified computer science conference.

Journals are better, as long as we move forward and not backward. We need open-access journals with fast turnaround times. Lance’s article itself underscores the point: it’s behind a pay wall, albeit a comparatively inexpensive and lenient one — Lance can distribute the near-final pre-print version on his own web page. That’s good but not good enough.

Kamal and Panos also have some refreshing ideas on this subject. Platforms like Yoav Freund’s machine learning forum represent a natural and intelligent evolution of peer review.

finance, trends

Microfunding: the next big small thing?

July 22, 2009 David Pennock 9 Comments

First micro lending, then microblogging, now microfunding.* Announcements of three funds recently sixdegreed their way to my doorstep, each smaller and faster than the next, a trend iconified by the famously speedy and minuscule Y Combinator:

1) George “Greek geek” Tziralis’s openfund; 2) Kevin Dick’s Black Swan Fund [via Daniel Horowitz]; and 3) the Awesome Foundation [via Foo Camp list].

The beginning of a trend?

Update 2010/03/19: The Black Swan fund, now called RightSide Capital is open for business.

___________

*Yet still no micropayments! 🙁

fun, gambling, goodbet, insurance, prediction markets, probability, science, search

Psst: WeatherBill doesn’t know New Jersey is the new Florida: Place your bets now

July 16, 2009 David Pennock 11 Comments

Quantifying New York’s 2009 June gloom using WeatherBill and Wolfram|Alpha

In the northeastern United States, scars are slowly healing from a miserably rainy June — torturous, according to the New York Times. Status updates bemoaned “where’s the sun?”, “worst storm ever!”, “worst June ever!”. Torrential downpours came and went with Florida-like speed, turning gloom into doom: “here comes global warming”.

But how extreme was the month, really? Was our widespread misery justified quantitatively, or were we caught in our own self-indulgent Chris Harrisonism, “the most dramatic rose ceremony EVER!”.

This graphic shows that, as of June 20th, New York City was on track for near-record rainfall in inches. But that graphic, while pretty, is pretty static, and most people I heard complained about the number of days, not the volume of rain.

I wondered if I could use online tools to determine whether the number of rainy days in June was truly historic. My first thought was to try Wolfram|Alpha, a great excuse to play with the new math engine.

Wolfram|Alpha queries for “rain New Jersey June 200Y” are detailed and fascinating, showing temps, rain, cloud cover, humidity, and more, complete with graphs (hint: click “More”). But they don’t seem to directly answer how many days it rained at least some amount. The answer is displayed graphically but not numerically (the percentage and days of rain listed appears to be hours of rain divided by 24). Also, I didn’t see how to query multiple years at a time. So, in order to test whether 2009 was a record year, I would have to submit a separate query for each year (or bypass the web interface and use Mathematica directly). Still, Wolfram|Alpha does confirm that it rained 3.8 times as many hours in 2009 as 2008, already one of the wetter months on record.

WeatherBill, an endlessly configurable weather insurance service, more directly provided what I was looking for on one page. I asked for a price quote for a contract paying me $100 for every day it rains at least 0.1 inches in Newark, NJ during June 2010. It instantly spat back a price: $694.17.

WeatherBill rainy day contract for June 2010 in Newark, NJ

It also reported how much the contract would have paid — the number of rainy days times $100 — every year from 1979 to 2008, on average $620 for 6.2 days. It said I could “expect” (meaning one standard deviation, or 68% confidence interval) between 3.9 and 8.5 days of rain in a typical year. (The difference between the average and the price is further confirmation that WeatherBill charges a 10% premium.)

Below is a plot of June rainy days in Newark, NJ from 1979 to 2009. (WeatherBill doesn’t yet report June 2009 data so I entered 12 as a conservative estimate based on info from Weather Underground.)

Number of rainy days in Newark, NJ from 1979-2009

Indeed, our gloominess was justified: it rained in Newark more days in June 2009 than any other June dating back to 1979.

Intriguingly, our doominess may have been justified too. You don’t have to be a chartist to see an upward trend in rainy days over the past decade.

WeatherBill seems to assume as a baseline that past years are independent unbiased estimates of future years — usually not a bad assumption when it comes to weather. Still, if you believe the trend of increasing rain is real, either due to global warming or something else, WeatherBill offers a temptingly good bet. At $694.17, the contract (paying $100 per rainy day) would have earned a profit in 7 of the last 7 years. The chance of that streak being a coincidence is less than 1%.

If anyone places this bet, let me know. I would love to, but as of now I’m roughly $10 million in net worth short of qualifying as a WeatherBill trader.

advertising, economics, incentives, spam

Meet the splORGers: The latest breed of web spam parasites

June 24, 2009 David Pennock 10 Comments

Via Muthu. This is mind boggling to me.

Sparasites on the web now somehow find it worth their while to invade ultra-specialized academic conferences. Call them splORGers. (In close analogy to sploggers).

The website focs2008.org appears to be the official home of the 49th Annual IEEE Symposium on Foundations of Computer Science. (In fact, it’s the top result for the search “focs 2008” in Bing, Google, and Yahoo!.) Historically a few hundred people attend to hear talks like “A Hypercontractive Inequality for Matrix-Valued Functions with Applications to Quantum Computing and LDCs”.

The website appears fully functional: you can browse the entire website structure including internal links like the list of accepted papers and external links like the online registration form.

But look more closely at the lower left corner of the front page. What do you see? SPAM KEYWORDS!: “Data Recovery Dell Memory HP Memory PC RAM wow accounts WoW gold”.

spam keywords on splORG site focs2008.org

WTF??!!

It turns out that focs2008.org is NOT the official FOCS 2008 conference home page. Rather, it’s http://www.cs.cmu.edu/~FOCS2008/. (Yahoo! ranks this site in second place, Bing and Google in seventh.)

This doesn’t seem like a zero-cost no-brainer automated attack. It involves identifying the appropriate domain name and mirroring another website, not as one-click as it sounds. There’s even a small sign of manual effort: the fox graphic in the upper left links to focs2007.org rather than 2008, as in the original. And of course there’s the cost to register and host the domain.

So why bother? Clearly, the perpetrator is not expecting real people to click on the spam links. At it’s peak, about as many people searched for “focs 2008” as for “pennock” and the offending links are fairly obscure. This is most certainly about siphoning link juice from seemingly legitimate .orgs that search engines trust.

But can that benefit really outweigh the cost? Again and again I simply fail to grok the economics of spam.

SplORGers have also set up camp at focs2007.org and ioi2008.org. Curiously, focs2009.org has a more transparent yet still head-scratching disclaimer.

Today, I stumbled onto a similar spamfiltration on mortgagepoints.com, the first external link on the Wikipedia definition of mortgage points, prompting me to finally write this post. Look what our ultra open web has wrought!

oddhead blog, spam

Recovering from swine’s infection (my blog, that is)

June 22, 2009 David Pennock 14 Comments

For the second time, a hacker (in the swine sense of the word) broke in and defaced Oddhead Blog. Once again, I’m left impressed by the ingenuity of web malefactors and entirely mystified as to their motivation.

Last week several readers notified me that my rss feed on Google Reader was filled with spam (“Order Emsam No RxOrder Emsam Overnight DeliveryOrder… BuyBuy…”).

The strange part was, the feed looked fine when accessed directly on my website or via Bloglines. Only when Google requested the feed did it become corrupted, thus mucking up my content inside Google Reader but not on my website.

(Hat tip to Anthony who diagnosed the ailment: calling curl http://blog.oddhead.com/feed/ yielded clean output, while the same request masquerading as coming from Google, curl -A ‘Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 10 subscribers; feed-id=12312313123123)’ http://blog.oddhead.com/feed/, yielded the spammed-up version.)

In the meantime, Google Search had apparently deduced that my site was compromised and categorized my blog as spam. Look at the difference between these two searches. Nearly every page containing the query terms, no matter how tangential, takes precedence over blog.oddhead.com in the results. [2009/06/23 Update: This is no longer the case: Apparently Google Search has reconsidered my blog.]

So began a lengthy investigation to find and eradicate the invader. The offending text did not appear anywhere in my WordPress code or database. Argg. I found that my plugins directory was world-writeable: uh oh. Then I found a file named remv.php in my themes directory containing a decidedly un-automattic jumble of code. Apparently this is an especially nasty bugger:

Iâ€™ve never seen a hack crop up with the tenacity of â€œremv.phpâ€ tho. Seriously, itâ€™s kind of scary.

I’m still not sure how or even if an attacker used remv.php to corrupt my feed in such a subtle way. I decided on surgery by chainsaw rather than scalpel. I exported all my content into a WordPress XML file, deleted my entire installation of WordPress, reinstalled WordPress, then imported my content back in. I restored my theme and re-entered some meta data, but I still have many ongoing repairs to do like importing my blogroll and other links.

The attack was clever: a virus that sickens but does not kill the patient. The disease left my web site functioning perfectly well, making it less likely for me to notice and harder to track down. The bizarre symptom — corrupting the rss feed but only inside Google Reader — led Chris to wonder if the attacker knew I was a Yahoo! loyalist. That seems unlikely. I don’t think I have enemies who care that much. Also, the spammy feed appeared in Technorati as well. Almost surely I was the victim of an indiscriminate robot attack. Still, after searching around, I couldn’t find another example of exactly this form of RSS feed “selective corruption”: has anyone seen or heard of this attack or can find it? And can anyone explain why?

What did I learn? I learned to listen to Chris and not make him mad. 🙂

I also found a bunch of useful WordPress security tips, resources, and plugins that might be useful to others including my future self:

WordPress, remv.php and you
3 must apply security tips for WordPress
Hardening WordPress
5 plugins to keep WordPress secure
Anatomy of a WordPress hack (“The kicker? All these sites were on Dreamhost.”)
Did your WordPress site get hacked?
DreamHost: Troubleshooting hacked sites
Dealing with a hacker on DreamHost
Docs on WordPress feeds
AskApache plugin to display all the internal WordPress URL rewrite rules (example use) (I couldn’t discern how to interpret the output)
WordPress exploit scanner plugin (I didn’t use after this question spooked me)
Secure WordPress plugin
AskApache password protect plugin