Good Bots Gone Bad
I’ve been keeping a very close watch on bots hitting eKstreme.com lately and I’ve come up with some interesting observations. Some of them are of great importance to webmasters (like MSNBot’s entries) and others are more just FYI. In no particular order:
- Feedburner’s bot does not obey robots.txt, specifically, this command:
User-agent: *
Disallow: /socializer/?It sends a HTTP 1.1 HEAD (not the usual GET) requests to the Socializer’s bookmarking pages. The remote host is chi-fetch.feedburner.com (66.150.96.121) and the user agent is FeedBurner/1.0 (http://www.FeedBurner.com).
- MyBlogLog still has an empty user agent string. I posted about this on Cre8 detailing how I emailed MBL support and they promised (within minutes) to forward it to an engineer. The remote host is www1.mbl.sp1.yahoo.com (69.147.90.63). If you browse to that, you get the MyBlogLog home page. Why should you care? Because if you block empty UAs as an anti-scraper method, MBL will get blocked too.
- MSNBot’s authentication doesn’t work. Yes, MSN’s bot authentication is BROKEN for these IP addresses: 65.54.165.43, 65.55.235.216, 65.54.165.65, and 65.55.233.40. A lot of crawling activity occurs from these addresses, but they resolve to *.phx.gbl (what is that, anyway, Microsoft?!) not to the expected *.search.live.com. Because of this, any crawls from these IP addresses do not authenticate and so are blocked here on eKstreme.com and blogSci.com.
- Where is Yahoo! Slurp’s bot authentication? It was promised way back at the end of March, but I’m still seeing about half of the Slurp requests from *.inktomisearch.com in addition to the promised *.crawl.yahoo.net that allows for authentication.
- Tailrank’s bot does not obey robots.txt. I emailed them a while back and they promised the next (then-imminent) update will fix that, but nope, not yet. Tailrank still rummages through eKstreme.com without regard to how it should behave.
- Techmeme’s bot does not identify itself. The remote host comes up as techmeme.com (75.126.195.146) and the user agent is Mozilla/5.0 (compatible; Wazzup1.0.7613; http://70.86.131.10/Wazzup).
- There is a lot of crawling activity from *.amazonaws.com. A quick background: Amazon is developing a whole great set of platform services called Amazon Web Services (AWS). I recommend everyone read up on them, especially EC2 and S3. The AWS crawling activity is mostly web startups looking to index the web or feeds – so fine. What I’m seeing is more and more evidence for scrapers running off AWS, which is not good. How to deal with this is a tough one: block everything from *.amazonaws.com or should Amazon personally identify the account holder using the remote host (like accountname.amazonaws.com)? The latter will strongly discourage scrapers and help in authenticating bots. If the scrapers get more frequent, everything on AWS will be blocked, including the good ones. Amazon needs to act now before this becomes a problem that affects all their customers.
- More on MSNBot, sometimes it doesn’t obey the robots.txt file. Hits from 65.55.208.139 are ignoring this command:
User-agent: *
Disallow: /devThat’s only started in the last few days. Weird.
I’m sure there is more I’ve missed, but this will do for now. I’ll leave you with a parting thought perhaps hinting at the future of search at MSN/Live: the HTTP_ACCEPT header MSNBot sends with all requests is this:
This is very interesting because:
- application/* is all applications and binary files, including ZIP files. We talked about how Google is indexing (badly) binary data and how that’s showing up in the SERPs. What are MSN/Live’s plans there I wonder? It could be simply to index PDF files so the question is, can MSNBot ‘see inside’ ZIP files?
- Model/vnd.dwf and drawing/x-dwf: Very interesting. Model/vnd.dwf is Autodesk Design Web Format and so is drawing/x-dwf, which as far as can understand are text-based representation of designs for web delivery. Will we start seeing AutoCAD designs in Live Image Search soon? As I like to say, this is "fertile ground for speculation"
Subscribe to Things of Sorts
If you liked this post, please subscribe to the Things of Sorts RSS feed: ![]()


