Good Bots Gone Bad

I've been keeping a very close watch on bots hitting eKstreme.com lately and I've come up with some interesting observations. Some of them are of great importance to webmasters (like MSNBot's entries) and others are more just FYI. In no particular order:

  • Feedburner's bot does not obey robots.txt, specifically, this command:
    User-agent: * Disallow: /socializer/?
    It sends a HTTP 1.1 HEAD (not the usual GET) requests to the Socializer's bookmarking pages. The remote host is chi-fetch.feedburner.com (66.150.96.121) and the user agent is FeedBurner/1.0 (http://www.FeedBurner.com).
  • MyBlogLog still has an empty user agent string. I posted about this on Cre8 detailing how I emailed MBL support and they promised (within minutes) to forward it to an engineer. The remote host is www1.mbl.sp1.yahoo.com (69.147.90.63). If you browse to that, you get the MyBlogLog home page. Why should you care? Because if you block empty UAs as an anti-scraper method, MBL will get blocked too.
  • MSNBot's authentication doesn't work. Yes, MSN's bot authentication is BROKEN for these IP addresses: 65.54.165.43, 65.55.235.216, 65.54.165.65, and 65.55.233.40. A lot of crawling activity occurs from these addresses, but they resolve to *.phx.gbl (what is that, anyway, Microsoft?!) not to the expected *.search.live.com. Because of this, any crawls from these IP addresses do not authenticate and so are blocked here on eKstreme.com and blogSci.com.
  • Where is Yahoo! Slurp's bot authentication? It was promised way back at the end of March, but I'm still seeing about half of the Slurp requests from *.inktomisearch.com in addition to the promised *.crawl.yahoo.net that allows for authentication.
  • Tailrank's bot does not obey robots.txt. I emailed them a while back and they promised the next (then-imminent) update will fix that, but nope, not yet. Tailrank still rummages through eKstreme.com without regard to how it should behave.
  • Techmeme's bot does not identify itself. The remote host comes up as techmeme.com (75.126.195.146) and the user agent is Mozilla/5.0 (compatible; Wazzup1.0.7613; http://70.86.131.10/Wazzup).
  • There is a lot of crawling activity from *.amazonaws.com. A quick background: Amazon is developing a whole great set of platform services called Amazon Web Services (AWS). I recommend everyone read up on them, especially EC2 and S3. The AWS crawling activity is mostly web startups looking to index the web or feeds - so fine. What I'm seeing is more and more evidence for scrapers running off AWS, which is not good. How to deal with this is a tough one: block everything from *.amazonaws.com or should Amazon personally identify the account holder using the remote host (like accountname.amazonaws.com)? The latter will strongly discourage scrapers and help in authenticating bots. If the scrapers get more frequent, everything on AWS will be blocked, including the good ones. Amazon needs to act now before this becomes a problem that affects all their customers.
  • More on MSNBot, sometimes it doesn't obey the robots.txt file. Hits from 65.55.208.139 are ignoring this command:
    User-agent: * Disallow: /dev
    That's only started in the last few days. Weird.

I'm sure there is more I've missed, but this will do for now. I'll leave you with a parting thought perhaps hinting at the future of search at MSN/Live: the HTTP_ACCEPT header MSNBot sends with all requests is this:

text/html, text/plain, text/xml, application/*, Model/vnd.dwf, drawing/x-dwf

This is very interesting because:

  • application/* is all applications and binary files, including ZIP files. We talked about how Google is indexing (badly) binary data and how that's showing up in the SERPs. What are MSN/Live's plans there I wonder? It could be simply to index PDF files so the question is, can MSNBot 'see inside' ZIP files?
  • Model/vnd.dwf and drawing/x-dwf: Very interesting. Model/vnd.dwf is Autodesk Design Web Format and so is drawing/x-dwf, which as far as can understand are text-based representation of designs for web delivery. Will we start seeing AutoCAD designs in Live Image Search soon? As I like to say, this is "fertile ground for speculation" ;)

Subscribe to Things of Sorts

If you liked this post, please subscribe to the Things of Sorts RSS feed:

2 Responses to “Good Bots Gone Bad”

  1. johnmu Says:

    Good stuff, Pierre. The information on AWS is particularly interesting - they make it easy and inexpensive to use their great infrastructure; perhaps easy enough for the scrapers to profit as well. How can or will they police their users?

    Also, something I’ve been wanting to check but never got around to: have you seen anything special with Google’s “crawling caching proxy”? Do you see multiple requests folded into a single access (eg a web-search crawl that is also re-used for Adsense)? It would be interesting to see how it would react to a differentiated robots.txt (sections blocked for the Adsense-bot but allowed for the web search bot)..

  2. Pierre Says:

    Hi John

    The AWS issue really is a spanner in the works. I suggested personally identifying the account holder, which will put the company name in the remote host. This enables authentication along the lines of GBot and MSNBot. Also, if the remote host is tied to a single person, then the required anonymity for scraping is removed.

    Not sure what you mean about the crawling caching proxy thing. Drop me an email please and I’ll take a closer look.

    Pierre

Leave a Reply

 

Site Navigation

Blog Categories

Popular Pages

The most popular pages on eKstreme.com.

Search

Subscribe

Subscribe to RSS 2.0 feed

Community

 
thermodelly