The Behavior of the Yahoo! Bots - Part I

I decided to look at how Yahoo! crawled my site in April. I've found some very interesting things. Some findings were new to me, but some were old news.

Methodology

Using the PHPCounter website analytics program, I searched for all hit in April whose remote host contained the word 'yahoo'. A total of 378 hits were found. Note that this will not detect all hits from Yahoo; indeed, the number of visits by the Yahoo! Slurp bot was much, much higher.

The data was the log files of eKstreme.com.

User Agents

I was surprised by the diversity of user agents that I found. The 'standard' Yahoo! Slurp user agent was the most active bot, and it was identfied as:

Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

No surprises there really. Yahoo! also have the Yahoo! Seeker bot, and indeed, I got a few hits from it. However, I found two user agents for it:

YahooSeeker/1.2 (compatible; Mozilla 4.0; MSIE 5.5; yahooseeker at yahoo-inc dot com ; http://help.yahoo.com/help/us/shop/merchant/)

DoCoMo/2.0/SO502i (compatible; Mozilla 4.0; MSIE 6.0; yahooseeker-jp-mobile AT Yahoo!JAPAN)

There was only a single hit from the second one, and it was from *.yahoo.co.jp. Localized mobile searching anyone?

Some of the more interesting visits from Yahoo domains had the following user agent:

Scooter/3.3

Scooter was the bot for AltaVista, which got bought by Overture, which Yahoo! acquired. It seems that Scooter is still alive, as evidenced by the occasional visits. Incidentally, Scooter sends the HTTP_REFERER header, and sets it to be exactly the page it is requesting; that is, it pretends the page is referring itself.

I also found a new (to me, at least) user agent:

Slurpy Verifier/1.0

It accessed only a single page from this blog (the .htaccess and Worpress entry) about 10 times. Why this particular interest, I have no idea.

One I expected to see was the Yahoo! Blogs bot. Indeed I found a few hits from it, but some of the hits were to non-Blog pages. The pattern seems that these non-blog pages that it requested were linked to from blog pages. I am not sure how generally accurate this observation is, but it is the pattern for now. The user agent was:

Yahoo-Blogs/v3.9 (compatible; Mozilla 4.0; MSIE 5.5; http://help.yahoo.com/help/us/ysearch/crawling/crawling-02.html )

Finally we come to the biggest two surprises: Yahoo user agents that do not identify themselves as Yahoo! bots. The first one is:

Nokia6600/1.0 (4.09.1) SymbianOS/7.0s Series60/2.0 Profile/MIDP-2.0 Configuration/CLDC-1.0

That sounds like a mobile phone (cellular phone for our American friends :) ). Was it a bot or someone using a phone to connect to eKstreme.com via the Yahoo! network? I think it was a bot because:

  • None of the hits had a referring URL.
  • All hits were to the eKstreme.com root index.
  • They all came from an IP address that resolves to test02.wap.search.scd.yahoo.com. The 'test02' bit suggests a testing site.
  • A total of 7 hits were requested, all occuring in a 3-second span. (Talk about stressing a website!)

So what's up with that Yahoo!?

The second big surprise were two, identical, hits. Both hits came from morgue2.corp.yahoo.com, and both had the following user agent:

Mozilla/4.05 [en]

This one is a bit too interesting for my taste: why would anyone call a server 'morgue', let alone have two of them (as in 'morgue2'), and why the very plain user agent? Too many questions.

Just for the sake of completeness, I want to note that I found quite a few 'normal' browser user agents, and they followed behavior expected from human surfers: they had real referring links, were clearly visits to several related pages (like using a tool and following through every page of the analysis), and had expected user agent strings.

In short, there seem to be quite a few Yahoo! bots around, which raises the simple question of 'why?' What does each bot do? Why does Yahoo! need them? Fighting spam might be one reason: Yahoo! may be sending diverse user agents in order to discover cloaking based on user agents. This is not that effective as clearly all hits were identifiable as coming from Yahoo!

On to The Behavior of the Yahoo! Bots - Part II

Technorati Tags: , , , , ,

Subscribe to Things of Sorts

If you liked this post, please subscribe to the Things of Sorts RSS feed:

Leave a Reply

 

Site Navigation

Blog Categories

Popular Pages

The most popular pages on eKstreme.com.

Search

Subscribe

Subscribe to RSS 2.0 feed

Community

 
thermodelly