Category Archives: spam

Oddhead Blog hacked… for the third time

May 6, 2012 David Pennock 14 Comments

My blog has been hacked yet again. For those keeping track, that’s infection number three. This latest exploit is very similar to the previous one. To humans arriving via browser (e.g., me), the site appears perfectly normal and healthy. Even upon clicking ‘view source’, nothing untoward is revealed. The <title> of my blog is, as always, Oddhead Blog.

However, when Google’s or Bing’s crawlers arrive to index my corner of the web, they see a different <title> altogether — Buy Cheap Cialis OnlineÂ — and immediately roll their eyes. (Actually even if you run 'curl http://blog.oddhead.com', you’ll see the spam keywords.) The effect of the attack is a kind of reverse cloaking. Cloaking is the black-hat SEO practice of serving legitimate content to crawlers and spam content to people. Here, the spam content is shown to the crawlers and the legitimate content to the people.

Once the crawlers report thisÂ appallingÂ information back to their respective mother ships, the search engines have no choice but to delist and demote my blog in their pagerankings. Right now, if you search for or within Oddhead Blog on Google, you’ll see how poorly the bots in Mountain View think of me:

Oddhead Blog hacked again: Spam titles in Google's cache 2012-04-27

You can hardly find any deep links into my blog by searching Google. For example, try searching for Bem+Wom, my invented term for “BEtter Mousetrap, Word of Mouth”. Even try “Bem+Wom oddhead blog”. You”ll find aggregators republishing my content, but no links to the original source, my blog, anywhere in sight. (Note to self: the Bing results for Bem+Wom are awful.)

Once again I am at a loss to understand my attacker’s motivation. Clearly it’s not to sell Cialis to my users, as they remain blissfully ignorant of any changes. The only benefit to anyone is to remove one relatively obscure blog from the search engine rankings and thus to move the attacker one slot up. Having a blog tangentially about gambling probably puts me into a shady neighborhood of the web, yet reverse-cloaking your competition (even if it can be somewhat automated and strike more than one competitor) seems like an awfully indirect way to improve one’s standing in Google. It’s also possible this is an act of pure vandalism.

So what should I do? Although I partly blame WordPress for writing insecure software, I may end up paying WordPress protection money to make this problem go away. I am seriously considering giving up on self hosting and moving my whole operation to worpress.com’s hosted service, where presumably security is tighter, or at least it’s not my responsibility any more. My web hosting service, DreamHost, may also be partly to blame, yet I like the company and have been quite happy with them in many respects. Any advice, dear reader? WordPress.com? Blogger? Try again and hope the fourth time is the charm? Should I be looking to ditch DreamHost as well?

challenges, fun, ideas, spam, woblomo

Recaptcha poetry: The barking

March 25, 2010 David Pennock 3 Comments

My first attempt. I’m sure you can do better.

The barking. Reenact task. Arnold understanding a caveated consideration. Snowmen ammunition, shoutouts stumbling. A gridded courtroom, community pumping. Cussions berate

advertising, economics, incentives, spam

Meet the splORGers: The latest breed of web spam parasites

June 24, 2009 David Pennock 10 Comments

Via Muthu. This is mind boggling to me.

Sparasites on the web now somehow find it worth their while to invade ultra-specialized academic conferences. Call them splORGers. (In close analogy to sploggers).

The website focs2008.org appears to be the official home of the 49th Annual IEEE Symposium on Foundations of Computer Science. (In fact, it’s the top result for the search “focs 2008” in Bing, Google, and Yahoo!.) Historically a few hundred people attend to hear talks like “A Hypercontractive Inequality for Matrix-Valued Functions with Applications to Quantum Computing and LDCs”.

The website appears fully functional: you can browse the entire website structure including internal links like the list of accepted papers and external links like the online registration form.

But look more closely at the lower left corner of the front page. What do you see? SPAM KEYWORDS!: “Data Recovery Dell Memory HP Memory PC RAM wow accounts WoW gold”.

spam keywords on splORG site focs2008.org

WTF??!!

It turns out that focs2008.org is NOT the official FOCS 2008 conference home page. Rather, it’s http://www.cs.cmu.edu/~FOCS2008/. (Yahoo! ranks this site in second place, Bing and Google in seventh.)

This doesn’t seem like a zero-cost no-brainer automated attack. It involves identifying the appropriate domain name and mirroring another website, not as one-click as it sounds. There’s even a small sign of manual effort: the fox graphic in the upper left links to focs2007.org rather than 2008, as in the original. And of course there’s the cost to register and host the domain.

So why bother? Clearly, the perpetrator is not expecting real people to click on the spam links. At it’s peak, about as many people searched for “focs 2008” as for “pennock” and the offending links are fairly obscure. This is most certainly about siphoning link juice from seemingly legitimate .orgs that search engines trust.

But can that benefit really outweigh the cost? Again and again I simply fail to grok the economics of spam.

SplORGers have also set up camp at focs2007.org and ioi2008.org. Curiously, focs2009.org has a more transparent yet still head-scratching disclaimer.

Today, I stumbled onto a similar spamfiltration on mortgagepoints.com, the first external link on the Wikipedia definition of mortgage points, prompting me to finally write this post. Look what our ultra open web has wrought!

oddhead blog, spam

Recovering from swine’s infection (my blog, that is)

June 22, 2009 David Pennock 14 Comments

For the second time, a hacker (in the swine sense of the word) broke in and defaced Oddhead Blog. Once again, I’m left impressed by the ingenuity of web malefactors and entirely mystified as to their motivation.

Last week several readers notified me that my rss feed on Google Reader was filled with spam (“Order Emsam No RxOrder Emsam Overnight DeliveryOrder… BuyBuy…”).

The strange part was, the feed looked fine when accessed directly on my website or via Bloglines. Only when Google requested the feed did it become corrupted, thus mucking up my content inside Google Reader but not on my website.

(Hat tip to Anthony who diagnosed the ailment: calling curl http://blog.oddhead.com/feed/ yielded clean output, while the same request masquerading as coming from Google, curl -A ‘Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 10 subscribers; feed-id=12312313123123)’ http://blog.oddhead.com/feed/, yielded the spammed-up version.)

In the meantime, Google Search had apparently deduced that my site was compromised and categorized my blog as spam. Look at the difference between these two searches. Nearly every page containing the query terms, no matter how tangential, takes precedence over blog.oddhead.com in the results. [2009/06/23 Update: This is no longer the case: Apparently Google Search has reconsidered my blog.]

So began a lengthy investigation to find and eradicate the invader. The offending text did not appear anywhere in my WordPress code or database. Argg. I found that my plugins directory was world-writeable: uh oh. Then I found a file named remv.php in my themes directory containing a decidedly un-automattic jumble of code. Apparently this is an especially nasty bugger:

Iâ€™ve never seen a hack crop up with the tenacity of â€œremv.phpâ€ tho. Seriously, itâ€™s kind of scary.

I’m still not sure how or even if an attacker used remv.php to corrupt my feed in such a subtle way. I decided on surgery by chainsaw rather than scalpel. I exported all my content into a WordPress XML file, deleted my entire installation of WordPress, reinstalled WordPress, then imported my content back in. I restored my theme and re-entered some meta data, but I still have many ongoing repairs to do like importing my blogroll and other links.

The attack was clever: a virus that sickens but does not kill the patient. The disease left my web site functioning perfectly well, making it less likely for me to notice and harder to track down. The bizarre symptom — corrupting the rss feed but only inside Google Reader — led Chris to wonder if the attacker knew I was a Yahoo! loyalist. That seems unlikely. I don’t think I have enemies who care that much. Also, the spammy feed appeared in Technorati as well. Almost surely I was the victim of an indiscriminate robot attack. Still, after searching around, I couldn’t find another example of exactly this form of RSS feed “selective corruption”: has anyone seen or heard of this attack or can find it? And can anyone explain why?

What did I learn? I learned to listen to Chris and not make him mad. 🙂

I also found a bunch of useful WordPress security tips, resources, and plugins that might be useful to others including my future self:

WordPress, remv.php and you
3 must apply security tips for WordPress
Hardening WordPress
5 plugins to keep WordPress secure
Anatomy of a WordPress hack (“The kicker? All these sites were on Dreamhost.”)
Did your WordPress site get hacked?
DreamHost: Troubleshooting hacked sites
Dealing with a hacker on DreamHost
Docs on WordPress feeds
AskApache plugin to display all the internal WordPress URL rewrite rules (example use) (I couldn’t discern how to interpret the output)
WordPress exploit scanner plugin (I didn’t use after this question spooked me)
Secure WordPress plugin
AskApache password protect plugin

economics, incentives, oddhead blog, spam

Intelligent blog spam

January 14, 2009 David Pennock 23 Comments

As I alluded to previously, I seem to be getting “intelligent spam” on my blog: comments that pass the re-captcha test and seem on-topic, yet upon further inspection clearly constitute link spam: either the author URI or a link in the comment body is spam.

Here is one of the most clear cases, received on January 9 as a comment to my post on the CFTC’s call for proposals to regulate prediction markets:

Date: Fri, 9 Jan 2009 01:28:01 -0800
From: Matt.Herdy
New comment on your post #71 “A historic MayDay: The US
government’s call for help on regulating prediction markets”
Author : Matt.Herdy
Comment:
Thanks for that post. Iâ€™ll put a note in the post.

1. Itâ€™s nothing new. The CFTC will just formalize the current
status quo.
2. We are prisoner of the CFTC regulations and the US Congressâ€™
distaste of sports â€œgamblingâ€. As for the profitability of prediction
exchanges in that strict environment, I donâ€™t see how you can deny that
HedgeStreet went bankrupt even though it was well funded. Isnâ€™t that a
hard fact?
3. Youâ€™re right, but all â€œpragmatistsâ€ should follow a business
plan and make profits. See point #2. Pragmatists wonâ€™t make miracles.

<a href=”http://www.stretch-marks-help.com/”>Removing stretch marks</a>

At first blush, the comments seems to come from a knowledgeable person: they refer to HedgeStreet, an extremely relevant yet mostly unknown company that’s not mentioned anywhere else in the post or other comments.

It turns out the comments seem intelligent because they are. In fact, they’re copied word for word from Chris Masse’s comments on his own blog.

Chris Masse’s page has a link to my page, so it could have been discovered with a “link:” query to a search engine.

Though now I understand what this spammer did, I remain puzzled exactly how they did it and especially why.

Are these comments being inserted by people, perhaps hired on Mechanical Turk or other underground equivalent? Or are they coming from robots who have either broken re-captcha or the security of my blog? (John suspects a security breach.)
Is it really worth it economically? All links in blog comments are NOFOLLOW links anyway, and disregarded by search engines for ranking purposes, so what is the point? Are they looking for actual humans to click these links?

In any case, it seems an intriguing development in the spam arms race. Are other bloggers getting “intelligent spam”? Does anyone know how it’s done and why?

Update 2010/07: Oh, the irony. I got a number of intelligent seeming comments on this post about SEO, nofollow, economics of spam, etc. that were… promoting spammy links. I left them for humor value though disabled the links.

advertising, commentary, economics, incentives, ratings, spam

The seedy side of Amazon's Mechanical Turk

August 13, 2008 David Pennock 25 Comments

I mostly side with Lukas and Panos on the fantastic potential of Amazon’s Mechanical Turk, a crowdsourcing service specializing in tiny payments for simple tasks that require human brainpower, like labeling images. Within the field of computer science alone, this type of service will revolutionize how empirical research is done in communities from CHI to SIGIR, powering unprecedented speed and scale at low cost (here are two examples). My guess is that the impact will be even larger in the social sciences; already, a number of folks in Yahoo’s Social Dynamics research group have started running studies on mturk. (A side question is how university review boards will react.)

However there is a seedier side to mturk, and I’m of two minds about it. Some people use the service to hire sockpuppets to enter bogus ratings and reviews about their products and engage in other forms of spam. (Actually this appears to violate mturk’s stated policies.)

For example, Samuel Deskin is offering up to ten cents to turkers willing to promote his new personalized start page samfind.

EARN TEN CENTS WITH THE BONUS – EASY MONEY – JUST VOTE FOR US AND COMMENT ABOUT US

EARN FOUR CENTS IF YOU:

1. Set up an anoymous email account likke gmail or yahoo so you can register on #2 anonymously

2. Visit http://thesearchrace.com/signup.php and sign up for an account – using your anonymous email account.

3. Visit http://www.thesearchrace.com/recent.php and vote for:

samfind

By clcking “Pick”

SIX CENTS BONUS:

4. Visit the COMMENTS Page on The Search Race, it is the Button Right Next to “Picks” on this page: http://www.thesearchrace.com/recent.php and

5. Say something awesome about samfind (http://samfind.com) on The Search Race’s Comments page.

Make sure to:

1. Tell us that you Picked us.
2. Copy and Paste the Comment you typed on The Search Race’s Comment page here so we know you wrote it and we will give you the bonus!

In fact, Deskin is currently offering bounties on mturk for a number of different spammy activities to promote his site. On the other hand, what Deskin is doing is not illegal and is arguably not all that different than paying PRWEB to publish his rah-rah press release (Start-up, samfind, Launches Customizable Startpage to Compete with Google, Yahoo & MSN, Los Angeles, California (PRWEB) August 4, 2008). And I have to at least give him credit for offering the money under his own name.

Another type of task on mturk involves taking a piece of text and paraphrasing it so that the words are different but the meaning remains the same. Here is an example:

Paraphrase This Paragraph

Here’s the original paragraph:

You’re probably wondering how to apply a wrinkle filler to your skin. The good news is that it’s easy! There are a number of different products on the market for anti aging skin care. Each one comes with its own special application instructions, which you should always make sure to read and carefully follow. In general, however, most anti aging skin care products are simply applied to the skin and left to soak in.

Requirements:
1. Use the same writing style as much as possible.
2. Vary at least 50% of the words and phrases – but keep the same concepts. Use obviously different sentences! Your paragraph should not be just a copy of the first with a few word replacements.
3. Any keywords listed in bold in the above paragraph must be included in your paraphrase.
4. The above paragraph contains 75 words… yours must contain at least 64 words and not more than 101 words.
5. Write using American English.
6. No obvious spelling or grammar mistakes. Please use a spell-checker before submitting. A free online spell checker can be found at www.spellcheck.net.

If you find it easier to paraphrase sentence-by-sentence, then do that. Please do not enter anything in the textbox other than your written paragraph. Thanks!

I have no direct evidence, but I imagine such a task is used to create splogs (I once found what seems like such a “paraphrasing splog”), ad traps, email spam, or other plagiarized content.

It’s possible that paid spam is hitting my blog (either that or I’m overly paranoid). I’m beginning to receive comments that are almost surely coming from humans, both because they clearly reference the content of the post and because they pass the re-captcha test. However, the author’s URL seems to point to an ad trap. I wonder if these commenters (who are particularly hard to catch — you have to bother to click on the author URL) are paid workers of some crowdsourcing service?

Can and should Amazon try to filter away these kinds dubious uses of Mechanical Turk? Or is it better to have this inevitable form of economic activity out in the open? One could argue that at least systems like mturk impose a tax on pollution and spam, something long argued as an economic force to reduce spam.

My main objection to these activities is the lack of disclosure. Advertisements and press releases are paid for, but everyone knows it, and usually the funding source is known. However, the ratings, reviews, and paraphrased text coming out of mturk masquerade as authentic opinions and original content. I absolutely want mturk to succeed — it’s an innovative service of tremendous value, one of many to come out of Amazon recently — but I believe Amazon is risking a minor PR backlash by allowing these activities to flow through its servers and by profiting from them.

Musings of a computer scientist on predictions, odds, and markets