How to Crawl the Web like Slurp, How to Block Such Scrapers, and Explaining Weird Slurp Requests

Yet another bot-related post. The subject this time is Yahoo!, specifically Yahoo!'s crawler Slurp and the Babelfish translation service. The topic is weak SE bot authentication, specifically, the ability to scrape content from sites that use a weak form of Slurp authentication. Also, while researching this, I came up with an explanation for weird hits coming from Yahoo's IP addresses.

All search engines introduced a mechanism to authenticate their bots. Yahoo! said back in June that all Slurp hits will come from *.crawl.yahoo.net, but my recent observations showed that's not the case. Still, it doesn't matter for scraping purposes as there is a way to mimic Slurp's behavior almost 100%, perhaps enough to get away with it.

To see it in action, using Firefox and the User Agent Switcher extension, set your UA to Slurp's namely "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)". Next, go to Babelfish and type your website's URL choose a language pair, and translate the page. The bottom frame is the one we're interested in, so right click on that and choose "This Frame->Show only this frame". Next check your log files.

In eKstreme.com's case translated into French, the frame's URL is this:

http://74.6.146.244/babelfish/translate_url_content?.intl=us&lp=en_fr&trurl=http%3A%2F%2Fekstreme.com

And the interesting hit details:

Remote host: proxy3.search.scd.yahoo.net (66.94.237.142)

UA: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) (via babelfish.yahoo.com)

Notice how close this is to a real Slurp hit: it has the word 'Slurp' in the user agent and comes from yahoo.net. So any (weak) Slurp authentication checking simply for Slurp and yahoo.net will be fooled. However, there are two key differences to allow for proper authentication:

  • The hit is not from *.crawl.yahoo.net, but from *.scd.yahoo.net.
  • The user agent has "(via babelfish.yahoo.com)" appended.

So in short: if you really want to authenticate Slurp, really do check for *crawl.yahoo.net in the remote host, not simply yahoo.net. Incidentally, that's the advice Yahoo! gives, so follow it!

There is more to this story. When Babelfish translates a page, it requests the URL to be translated twice. The first one is a HTTP HEAD request, and all being well (like not getting a 404 error), the page is properly requested using HTTP GET, which is what I described above. If an error is encountered, the GET request is not sent.

The HEAD request is very interesting as it sets the UA to be identical to the browser requesting the translation, without adding "(via babelfish.yahoo.com)". So what the hit will appear as in log files is a UA coming from *.yahoo.net. Which UAs can you see? Anything out there: I've seen bots like Sogu, AdSense Mediabot, IE, and Firefox. If you spoof your browser to be Slurp as described above, you'll get a Yahoo! Slurp request coming from *.scd.yahoo.net not from *.crawl.yahoo.net, meaning that the Slurp authentication will fail.

This, I believe, explains a question someone posted a year ago at Webmaster World. Following on from a different thread, Yahoo_Mike explained what was going on but didn't explain the details of the double-requests.

So in summary, three things:

  • Double check how you authenticate Slurp and make sure you're doing it properly to avoid scrapers.
  • We now have an explanation of the weird behavior seen from *.yahoo.net proxies with details about exactly what's going on.
  • Why doesn't Babelfish identify itself with the HEAD requests? Come on Yahoo!, you can do better!

Subscribe to Things of Sorts

If you liked this post, please subscribe to the Things of Sorts RSS feed:

Leave a Reply

 

Site Navigation

Blog Categories

Popular Pages

The most popular pages on eKstreme.com.

Search

Subscribe

Subscribe to RSS 2.0 feed

Community

 
thermodelly