New Stealth Crawler from Yahoo!

For the past few months, I've been tracking a crawler from Yahoo! that does not identify itself on my science blog. The bot's details are:

Requested page: /science/converting-blood-groups
  • At: 06 May 2008 10:21:05 AM GMT
  • Routed to: /index.php
  • Referred from: http://blogsci.com/science/converting-blood-groups
  • Remote: crawl1.image.srch.kr1.yahoo.com (203.212.174.181)
  • Request: HTTP/1.1 GET
  • Accepting:
    • HTTP: */*
    • Charset:
    • Enconding:
    • Languages:
  • UA:
  • Cookies:

Notice a few interesting details: No user-agent string, the fact it provides an HTTP_REFERER header that's the same page being requested, it comes from *.yahoo.com not the usual yahoo.net for Slurp, and the fact it says "image" and "srch" in the host.

The tracking is very low-level, a few hits a day with lots of one-hit-a-day visits.

What's really interesting is how laser-targeted it is: it's only requested the same two pages many times since May. The pages are the specific blog post linked to above plus the archive page that contains that post, so it's likely something about that post that's of interest to the bot. And yes, the post contains an image, and the image is the only one in the main content of the archive.

I'll dig deeper when I have a chance. Please let me know in the comments below if you're seeing something similar.

The Ultimate jQuery Development Guide

This has got to be the best jQuery development guide I've seen. It's one of those pages you bookmark or add to your scrapbook for those late night hacking sessions when things go wrong.

What do you do with Unauthenticated Search Engine Bots?

Over at Search Engine Journal, Ann Smarty explains how to switch your UA to Googlebot and browse the web. The technique uses a Firefox extension to change the user agent string to that of Googlebot. Simple and works a treat. Except for...

The problem here is that it is very easy to authenticate Googlebot, Slurp, or MSNBot. The three major search engines give us a double-DNS trip to check whether a request pretending to be one of their crawlers is genuine or not. The authentication helps us webmasters fight against crawlers (not to mention other things ;) ). So the SEJ article is useful but it's not 100% foolproof and people pretending to be GBot/Slurp/MSNBot will probably get trapped with snares laid by clever webmasters.

This raises an interesting question: If you do authenticate SE bot requests, what do you do with unauthenticated ones?

Personally, I just block all unauthenticated bots. The request is served with a blank page without any content. I've found that this helped stop *all* (yes all) unauthenticated bots but with proportional rise in more sleuthing bots (i.e. scrapers pretending to be a browser). No matter, this is an arms race and I'm in it for the long-run.

Other people suggest you should feed unauthenticated requests with content that AdSense frowns upon like guns or porn. The idea is that these crawlers are out to get your content for MFA sites and so it's best to get them banned the quick and dirty way.

Others suggest just ignoring them; after all, they'll come back with a different UA anyway, so what's the point? This attitude bothers me because it just means giving up and letting your content get scraped far and wide without any control.

So what do you do with unauthenticated bots and more generally, what do you do with bots?

Stop Competitors from Stalking Your Website Using AdWords

Regular readers will know that I like to gaze at my log files in search of life-changing inspirational moments. Well I have another such gem of an inspiration for you: figuring out if someone is stalking your website using the Google AdWords keyword tool and how to stop them.

When someone goes to the AdWords keyword tool and asks for keywords based on the contents of a web page (the "Website content" option), Google actually requests the page live. This request shows up in the logs and can of course be blocked. The details are:

Referred from: (No referer.)
Remote: 74.125.16.37
Request: HTTP/1.1 GET
UA: Mozilla/5.0 (compatible; Google Keyword Tool; +https://adwords.google.com/select/KeywordToolExternal)

So what to do? Be careful blocking the IP addresses as a general precaution against stopping legitimate requests from Google IP addresses (Googlebot, Google's Feedfetcher, etc). However, the user agent is a good tell-tale sign and is ripe for blocking.

So: aim... fire!

Fire what though? A simple block? Nah, not much fun that. Knowing full well that only competitors will use that service to check out which keywords your pages might rank for, I would feed the requests dud content. Lorem ipsum anyone? How about random content about keyword theft? Here is an SEO exercise for you: which keywords can you get the Adwords keyword tool to show about your pages? To rephrase: what keywords can you "rank" for in the tool?

And don't forget to go back into your logs and see how many times people have stalked you.

What is YahooCacheSystem?

I just started noticing some hits coming from a few *.yahoo.net IP addresses with a user agent of just "YahooCacheSystem" and requesting only the raw RSS XML feed so far. All requests are HTTP/1.0 GET, setting the HTTP_ACCEPT to */*. No other headers are set.

The first hit I've seen was on April 27th, which came from the IP address 216.39.58.78. Back then, that resolved to htproxy3.ops.re4.yahoo.net. However, ever since, the hits are all from a different C-block, 209.131.41.*, which resolves variously to, htproxyX.ops.sp1.yahoo.net (X is a number like 1 or 2 to give htproxy1.ops.sp1.yahoo.net or htproxy2.ops.sp1.yahoo.net). Even more recently, the IP addresses remained the same, but the hosts they resolve to changed to htproxyX.ops.re4.yahoo.net (again, X is a number to give htproxy1.ops.re4.yahoo.net or htproxy2.ops.re4.yahoo.net).

I post about this bot for one simple reason: the UA is very intriguing and the fact that it's requesting just RSS XML feeds is also interesting. Are we going to see a Yahoo! service or a set of services that deal with just blogs?

TechCrunch reported way back in 2005 about the launch of Yahoo! blog Search, which back then and today has pointed to what Yahoo! calls the News Search, which according to the web page is to "Search real-time news stories from Yahoo! News and across the web." That's fine and dandy, but it's no blog search per se.

So the YahooCacheSystem bot could represent one of two things:

  • Yahoo! is consolidating its backend infrastructure to deal with RSS-based sites better. So they are building a centralized RSS cache for all their services to use. For webmasters, this means we now have a new analytics data point we can look at.
  • Or... (wait, I need peer at my crystal ball...) Yahoo! is moving towards building a serious set of services centred around XML feeds. This could mean we could see a true blog search product soon, or something else we can only guess at.

So which one is it? I can only provide guesses. Given the utter lack of evidence and, more importantly, rumors, I'm leaning towards the infrastructure explanation. However, a good infrastructure is necessary for a major strategic shift or product launch. Time will tell.

Live.com Spambot Ignores robots.txt

Oh, MSNbot, when will you ever learn? I won't rehash the story that lead me to blocking MSN's referral-spamming bot, and that seems to have worked a bit. The problem is that the referral spam is still coming in! Yes, MSNbot is blocked but the spammy hits are still coming in.

Case in point, this hit from today over at Social Alerter:

/tips/how-not-get-dugg
  • At: 19 April 2008 11:04:39 AM GMT
  • Referred from: http://search.live.com/results.aspx?q=alerts&mrt=en-us&FORM=LIVSOP
  • Remote: livebot-65-55-165-107.search.live.com (65.55.165.107)
  • Request: HTTP/1.0 GET
  • Accepting:
    • HTTP: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
    • Charset:
    • Enconding:
    • Languages: en-us
  • UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
  • Cookies:

Is it just me or is this beyond comical now?

The Real Strategy Behind Google App Engine

I just had an "OMG this will change the world!" kind of moment while playing for just 5 minutes with Google's App Engine. Let me explain.

A bit of background first: The Google App Engine is a newly-launched service from Google, that for a change, seems to be well thought out. The service offers a Python-only environment (for now) to build applications locally and host them on Google's vast infrastructure. The idea here is that you don't have to worry about scaling your application to handle massive traffic and let the App Engine running on Google's servers deal with it. The Engine comes with lots of goodies like handling database stuff, user logins (and what a boon that will be for Google accounts), and others. All in all, a nice comfy environment for rapid application development and reliable hosting.

But from all the buzz on the net, I think there is something missing that I just hinted at above:

to build applications locally and host them on Google's vast infrastructure

App Engine comes with its own development setup that runs off your computer (available for Windows, OSX, and Linux). You develop the application on your computer, run it, test it, add features, and then upload it to Google's computers. My question is this: What's stopping Google from turning the local development code into a full desktop-based runtime for web applications? Why keep it as a development-only environment?

Let's look at this from another angle: the desktop-webapp integration market. Adobe recently released their oddly-named AIR (Adobe Integrated Runtime). In the AIR-world, you can write applications in HTML/CSS/JS or Actionscript and package them into desktop applications that run within AIR or within the Flash player in the browser. The AIR environment is available for Windows and Macs, and Linux support is on the way. Brilliant move: one code base, both browser and desktop functionality.

Microsoft also has a similar play in the form of .Net, and more specifically Silverlight. The .Net runtime is available for many devices and platforms (mobile, desktop, and I think even the XBox). With Silverlight, Microsoft's play is to give developers a platform to use .Net in the browser; this is coming in Silverlight 2.0 this summer. So with this, again, one code base can be used on the web and on the desktop to give true multi-platform programming.

There are other entries in this market, Mozilla Prism being a prominent example. They all promise the same thing: one code, many places to run it with varying details.

Now back to App Engine and to the question I posed: imagine Google comes out with a desktop runtime/environment that turns App Engine webapps into desktop-based apps. This will be directly parallel to Adobe's AIR but with a big difference: the same code will also be easily deployable on a reliable and scalable infrastructure - Adobe doesn't have that.

There is another difference: because of the way App Engine works, you could easily imagine it talking to Google Apps like Google Docs etc. A desktop App Engine will bring Google's applications onto the desktop and open up a market-disrupting war: direct office productivity competition with Microsoft. To rephrase, App Engine could be Google's way to enter Microsoft's turf on the desktop.

So any evidence for this? Nothing solid, so it's all speculation, but I'll point to three hints:

  • The name. It's not App Server or App Service but App Engine. Google understands branding well enough (it's arguably the main source of their traffic) so their choice of words here is intriguing. And I can't help but think that Google's App Engine will drive some sort of Google Gears. Nudge, nudge, wink, wink.
  • When creating an application, you can specify that only users of a certain Google Apps domain can use the app. This integration with Google Apps is perhaps hinting at bigger things to come.
  • The APIs available in App Engine: already App Engine supports dealing with mail, and given the point above, you can imagine an API for the other Google Apps. This would enable a go for the desktop market.

What do you think? I think this is the best move out of Google yet and as disruptive as AdWords was.

MS Live Still Referral Spamming

That's right folks, after the initial fuss, the backtracking (with its very own official statement!), Microsoft's Live search engine is still doing these referral spamming requests.

I'm seeing this on my new service Social Alerter. The request details:

  • Remote: livebot-65-55-165-77.search.live.com (65.55.165.77)
  • UA: of Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
  • Referring URL: http://search.live.com/results.aspx?q=social&mrt=en-us&FORM=LIVSOP

Full list of IP addresses doing this:

  1. 65.55.165.90
  2. 65.55.165.43
  3. 65.55.165.96
  4. 65.55.165.120
  5. 65.55.165.100
  6. 65.55.165.76
  7. 65.55.165.16

The fake search queries are all either [social] or [alerts].

Anyone else seeing this? It's clearly not fixed as they claimed and is starting to get annoying.

GTalk Translator Bot is Mediocre but Useful

By now you must have heard that Google Talk now includes translation bots you can invite into a conversation. When you invite any of these bots, they translate whatever you type from your language into the target language. A very brilliant idea with a perfect implementation mechanism, but does it work? Let's find out.

I've mentioned before that I am an Arabic speaker. Given that Arabic sports one of the most convoluted grammars on Earth, I thought what better way to test the bots by having a solo chat with the en2ar bot. That is, I write in English and watch its Arabic responses. The results are below:

Google Talk translation bot conversation translating English to Arabic

Arabic speakers among you will spot many mistakes but the ideas are still mostly translated well. With basic phrases, the translation is flawless in most cases. With more convoluted writing, the translation breaks down. You can see two comments relating a bad translation. The first one said "This translation sucks" which colloquially in English, that means it's bad. The translation used the meaning of "suck" literally, i.e., something you'd do to straw and some juice. The next phrase saying "This is a bad translation" was translated, well, badly, but the idea was still conveyed. The translation in Arabic actually says "This is the bad of translation". This grammatical structure is used in Arabic to emphasize the pinnacle of something (i.e. exemplary in its class), so in this case, the Arabic actually means "This is the worst of translation".

So all in all a useful feature but I don't see it being used for anything important like a business chat: the mistakes are simply too frequent for this to be used to convey complex ideas. It is machine translation after all and the state of the art is still bad.

Query String Collapsing

One of the problems that search engines and analytics packages have is dealing with URLs with query strings. For example, the following two URLs will be return the same content from any given content management system but they are two different URLs in the eyes of search engines and analytics packages:

http://example.com/page.php?id=1&title=hello&from=homepage

http://example.com/page.php?title=hello&from=homepage&id=1

So how can we figure out that they are actually the same URL really? The solution I came up with is a simple multi-step processing algo. It goes like this:

  • Take the query string variables and save them in an array. So in the case of our first URL, the array would contain the following key=>value pairs:

    $vars = array('id'=>'1', 'title'=>'hello', 'from'=>'homepage');
  • Next, sort the array by alphabetical order based on the keys names, like:

    $vars = array('from'=>'homepage', 'id'=>'1', 'title'=>'hello');
  • Now rebuild the URL based on the new order of the variables:

    http://example.com/page.php?from=homepage&id=1&title=hello
  • By now the trick should be clear: if you do that to all the URLs, you would always reach the same final re-composed URL as long as the variables are same (i.e. the same names and one URL doesn't have extra or missing variables).

I call this Query String Collapsing. Why "collapsing" instead of normalization or decomposition? No real reason apart from thinking about this as collapsing a whole slew of URLs into a single representative entity. And I just like that name more that way :)

With this, what can we do with analytics? Save both the original URL as requested and the collapsed URL. This opens up a nice set of funky things you can do, but that's another post...

MS Admits to Referral Spamming for As Cloaking Check

Hot off the press: after the fuss raised by a bunch of us a few weeks ago, Donna now reports that Live ponies up about the referrer spam. They've issued a statement where they:

  • A bug that caused issues with AdSense/Overture reporting.
  • Distorting site statistics with unfilterable bot traffic (except we know how to filter them!)
  • Polluting HTTP logs with inappropriate terms (true).

Microsoft also states that "Hopefully webmasters have also noticed these issues disappearing. If you are still experiencing any issues, please contact us before you block MSNBot, to see if we can address the issue."

Let me be the first to say a big thank you to Microsoft for making a very solid public response to the issue and answering our questions. This kind of transparency is exactly what fosters a good relationship between a search engine and webmasters.

And yes, Live.com team, I do default to your search engine for my searches. Works a treat (most of the time ;) ).

PHP Auto Prepend and its Uses

A thread at SEO Refugee started with asking for help about a funny URL, which we deciphered to be a probe to try to attack the website. I've seen this before and so I suggested that they block the whole IP C-block. Which turned into the question of: how do you block IP addresses?

The way I do it is using PHP and it uses a nice little trick that few seem to know about. This is how:

PHP has a feature that allows you to pre-pend a file at every PHP request. This prepend file is the equivalent of having it include()ed at the top of every single PHP script on your site. It's is done through a directive that is set either in php.ini or .htaccess. The directive is called auto_prepend_file. For .htaccess, this is what I use:

php_value auto_prepend_file "/full/path/to/a/prepend-file.php"

Because it runs at every PHP request and it runs before the actual requested script, you can do some really neat things. So what do I do? I'm developing this system internally and at the moment it does three things:

  1. Authenticate SE bots
  2. Analytics (the data logger)
  3. Block IP addresses

The blocking works as follows: there is a special directory where I put empty files that dictate the blocking. The file names are of two formats: a.b.c.d or a.b.c depending if want to block a specific IP address (the former format) or a C-block (the latter). In the pre-pend file, there is a simple check: figure out the remote IP address, and check the for the presence of either its file or its C-block file. So if the remote IP is 111.222.333.444, it checks for the prsence of either /111.222.333 or /111.222.333.444. If either exist, a 403 not authorized header is returned and the code exit()s, so no actual content gets displayed.

This raises the question: how do you add files to the directory? Using a web interface of course :) You can do it with a simple touch() or an fopen().

For completeness, there is a sister directive called auto_append_file which runs after each PHP script is called (with the exception that if script exit()s, the append file doesn't run). I've never used it, but it can be useful for things like measuring how quickly scripts run on your server.

Arabic SEO

I've been thinking about using Arabic in URLs, a question asked by Rand Fishkin of SEOmoz over at Cre8. Rand's question was:

What if you are optimizing in the Arabic character language set and want to include "keywords" in your URL

As an Arabic speaker and user of Arabic websites, I feel I can help answer this one. The answer is applicable to other languages as it deals with technical issues faced by all non-English language. Arabic is merely the language we draw specific examples from. So here goes...

Talking in (en)code

URLs are allowed only a certain set of characters for them to work: the English alphabet (both lowercase and uppercase), the numbers, dashes, dots, forward slashes, and the question mark, and a few others. These chosen few of characters are based on American English as defined by the ASCII standard for historical reasons. All other characters, like English punctuation and non-English characters have to encoded.

The question needs to be answered for domain names too. Wikipedia has a nice summary of international domain names that allow non-ASCII characters in them. However, support for that is not universal yet, and as we'll see later, different browsers will handle internation domain names differently. For now, I would recommend steering clear of these for SEO purposes.

Usability trumps the day?

OK, so we know that non-ASCII characters have to be encoded, and so what about Rand's question about keywords in the URL? This raises a very interesting question: If you know the URLs are going to be encoded, doesn't usability dictate that you use non-encoded text? So given the choice between these two URLs:

  • site.com/?page=%D8%AF%D9%84%D9%8A%D9%84
  • site.com/?page=directory (Rand's question was about a directory as in DMOZ not as in folder)

which one would you choose? Is the presence of the word in the URL really that key for ranking?

I would argue that in this case, for usability's sake, I would go for something like:

  • site.com/directory
  • site.com/node/1

Then the actual page contents will be in Arabic or any other language. The anchor text is also key, so in-site optimization becomes super-critical, not to mention on-page techniques.

International domain names

The other thing to consider is how users input URLs. Do they type them? How important is type-in traffic for the site under question? Most likely, people will type the domain name in English. Speaking of which, try this site: search that points to an Arabic domain name (see PS below as to why I'm link to Google to give the domain name) and watch the URL in different browsers. Safari keeps the URL as it is, but Firefox and IE 7 change it to http://xn--ugb6bax.com/. That last URL is certainly not memory-friendly.

So to sum up: I would be careful how I use non-English letters in URLs.

What about CLIR?

Back in May, Google came out with the all-singing-all-dancing-Ask.com-copycat called Universal Search. Buried in the announcements is a little gem called Cross-Language Information Retrieval (also see this). From SEL's post:

Search queries will be entered in the native language, translated into English and run against Google's index. Any retrieved pages/sites will then be translated from English back into the native language.

I'm sure this will affect the kind of SEO we're talking about here, but I haven't done any tests to see how and how much. Anyone got data?

Anglicized Arabic

Another thing you will notice is anglicized Arabic (Arabic written in English): sometimes you'll see numbers in the middle of the words. This is because there is a colloquial transliteration system developed over the past decade or so (thanks to mobile phones and the internet!) to write Arabic-only letters in English. Example: Arabic has two H-like characters. There is one pronounced as the H in Henry and one as more deeper, sounding almost like an H you would if you have a scratched throat. The Henry-like H is transliterated into H in English, and the second type of H is transliterated into a 7.

How the search engines handle (parse and index) such transliterated text is a very big question. A quick search for [7mar] (donkey in Arabic :), which is used as an insult along the lines of stupid or moron) shows that quite a few pages are indexed with that "word" in it. Interestingly, Google thinks these pages are English, and if you do an Arabic search specifically, you get another set of SERPs suggesting that unless explicitly told it's Arabic, Google at least will get confused.

There is more to this story, as part of another bigger story, but that's for another post. In the meantime, please post questions and comments below :)

PS - Why isn't there a word of Arabic in this post? It's because WordPress thinks it's the best piece of software in the world and keeps editing my Arabic into question marks. Using either IE or FF does not fix this problem.

Googlebot Requested Another JS File

Back in January, I blogged about how Googlebot requested CSS and JS files. Ever since the news broke (to much fanfare), this front of SEO has been awfully quiet.

Well, GBot is at it again. This time it requested just a JS file and followed a 302 redirect to get it. The JS is embdeded into the page with this code:

<script type="text/javascript" src="filename.js"></script>

The request for filename.js 302 forwards to another location which returns a 404, which I log. Yes, it's a trap designed to see who requests JS files and who doesn't (as part of my on-going bot research documented the past few posts).

The hit details are:

  • Remote: crawl-66-249-70-178.googlebot.com (66.249.70.178)
  • Using HTTP/1.1 GET
  • UA: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

The second request (the one GETting the URL in the 302 forwarding) came 1 second after the first request with identical hit settings.

And that's about it from me. Anyone else seeing similar hits?

Yahoo! SERPs Embed API RSS Feed

Something I've never seen before. In Firefox, I noticed that the Yahoo! SERPs are now showing the feed icon. For example, the search for [seo blog] embeds an API-based RSS feed of the query. Interestingly, the app identification (a required field for all API calls, and is supposed to be developer specific) is "yahoosearchwebrss".

I tried this for a few searches, both simple searches, like [seo blog], and more advanced ones like [inurl:ekstreme.com]. It's showing up for the UK-locali(s)zed old SEPRs interface and the newly redesigned one.

I'd like to say this is a very cool idea, one to be filed under "Why didn't I think of that?" Well done Yahoo! for another innovation in SERPs.

Programmers: What’s your font?

An odd question to ask, but seriously: which font do you use in your text editor? Anything exotic? Is it monospaced or variable width? Serifed or not?

I ask because... well, no reason really. Just curious about what unsung heros we use. I use Verdana because it's easier to read, and I'm now experimenting with Segoe UI. Never got into the monospaced fonts, even for programming.

So try changing your font today. You might just like it!

How to Crawl the Web like Slurp, How to Block Such Scrapers, and Explaining Weird Slurp Requests

Yet another bot-related post. The subject this time is Yahoo!, specifically Yahoo!'s crawler Slurp and the Babelfish translation service. The topic is weak SE bot authentication, specifically, the ability to scrape content from sites that use a weak form of Slurp authentication. Also, while researching this, I came up with an explanation for weird hits coming from Yahoo's IP addresses.

All search engines introduced a mechanism to authenticate their bots. Yahoo! said back in June that all Slurp hits will come from *.crawl.yahoo.net, but my recent observations showed that's not the case. Still, it doesn't matter for scraping purposes as there is a way to mimic Slurp's behavior almost 100%, perhaps enough to get away with it.

To see it in action, using Firefox and the User Agent Switcher extension, set your UA to Slurp's namely "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)". Next, go to Babelfish and type your website's URL choose a language pair, and translate the page. The bottom frame is the one we're interested in, so right click on that and choose "This Frame->Show only this frame". Next check your log files.

In eKstreme.com's case translated into French, the frame's URL is this:

http://74.6.146.244/babelfish/translate_url_content?.intl=us&lp=en_fr&trurl=http%3A%2F%2Fekstreme.com

And the interesting hit details:

Remote host: proxy3.search.scd.yahoo.net (66.94.237.142)

UA: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) (via babelfish.yahoo.com)

Notice how close this is to a real Slurp hit: it has the word 'Slurp' in the user agent and comes from yahoo.net. So any (weak) Slurp authentication checking simply for Slurp and yahoo.net will be fooled. However, there are two key differences to allow for proper authentication:

  • The hit is not from *.crawl.yahoo.net, but from *.scd.yahoo.net.
  • The user agent has "(via babelfish.yahoo.com)" appended.

So in short: if you really want to authenticate Slurp, really do check for *crawl.yahoo.net in the remote host, not simply yahoo.net. Incidentally, that's the advice Yahoo! gives, so follow it!

There is more to this story. When Babelfish translates a page, it requests the URL to be translated twice. The first one is a HTTP HEAD request, and all being well (like not getting a 404 error), the page is properly requested using HTTP GET, which is what I described above. If an error is encountered, the GET request is not sent.

The HEAD request is very interesting as it sets the UA to be identical to the browser requesting the translation, without adding "(via babelfish.yahoo.com)". So what the hit will appear as in log files is a UA coming from *.yahoo.net. Which UAs can you see? Anything out there: I've seen bots like Sogu, AdSense Mediabot, IE, and Firefox. If you spoof your browser to be Slurp as described above, you'll get a Yahoo! Slurp request coming from *.scd.yahoo.net not from *.crawl.yahoo.net, meaning that the Slurp authentication will fail.

This, I believe, explains a question someone posted a year ago at Webmaster World. Following on from a different thread, Yahoo_Mike explained what was going on but didn't explain the details of the double-requests.

So in summary, three things:

  • Double check how you authenticate Slurp and make sure you're doing it properly to avoid scrapers.
  • We now have an explanation of the weird behavior seen from *.yahoo.net proxies with details about exactly what's going on.
  • Why doesn't Babelfish identify itself with the HEAD requests? Come on Yahoo!, you can do better!

Google Web Accelerator: Please identify yourself

A while back, I noticed some fishy bot-like behavior coming from a Google-owned IP address. After asking around, a friend suggested it could be Google Accelerator. So I emailed Google support and, cut a long story short, it indeed was Google Accelerator (GWA for short).

The IP address back then was 64.233.172.34, which Google confirmed to be a public-facing GWA IP address. The hits were very bot-like: no referer, requesting pages blocked by robots.txt, and identifying themselves using the default user agents for IE or Firefox. However, the hits also showed atypical bot signs: Looking through the log files, I noticed that after the page is requested, all associated files are also requested: the Javascript files, the image files, and the CSS files. Interesting in its own right because remember, the hits are coming from a Google IP address but are really requests from real users - the GWA was acting as a proxy. I hope the implications of this are clear.

Regardless, I dropped it - my question was answered. But now it's back...

Over the past 10 days or so, a new IP address started to show the same pattern. This time, the IP address is 66.249.85.133 and it certainly belongs to Google. It resolves ff-in-f133.google.com and requests using HTTP/1.1 and asking for gzip'ed pages. The requested pages are still ones blocked by robots.txt, identify themselves as IE 6.0 (default user agent), and come in without any referer. However, this time associated JS files are not requested, putting the new behavior firmly in botland.

So far, I've noticed only a few hits, none of which identified themselves as Firefox. Given the history, my best bet at the moment is that it is GWA again on a new IP address, but the lack of JS requests makes me wonder if they also updated the code - maybe for analytics purposes? Regardless, GWA is still acting as a proxy, and so I expect it to identify itself as such. It can easily modify the user agent to hint that it's there. At the very least, it will be useful for analytics; examples of why identification is useful:

  • How many GWA requests does your site get?
  • Are GWA requests labelled as bots and discounted?
  • Should GWA requests be labelled as bots? This is more philosophical than technical.
  • Can GWA be used to scrape websites?

And of course, many more questions. So if anyone works for Google maybe you can spare a minute for this? :D

Good Bots Gone Bad

I've been keeping a very close watch on bots hitting eKstreme.com lately and I've come up with some interesting observations. Some of them are of great importance to webmasters (like MSNBot's entries) and others are more just FYI. In no particular order:

  • Feedburner's bot does not obey robots.txt, specifically, this command:
    User-agent: * Disallow: /socializer/?
    It sends a HTTP 1.1 HEAD (not the usual GET) requests to the Socializer's bookmarking pages. The remote host is chi-fetch.feedburner.com (66.150.96.121) and the user agent is FeedBurner/1.0 (http://www.FeedBurner.com).
  • MyBlogLog still has an empty user agent string. I posted about this on Cre8 detailing how I emailed MBL support and they promised (within minutes) to forward it to an engineer. The remote host is www1.mbl.sp1.yahoo.com (69.147.90.63). If you browse to that, you get the MyBlogLog home page. Why should you care? Because if you block empty UAs as an anti-scraper method, MBL will get blocked too.
  • MSNBot's authentication doesn't work. Yes, MSN's bot authentication is BROKEN for these IP addresses: 65.54.165.43, 65.55.235.216, 65.54.165.65, and 65.55.233.40. A lot of crawling activity occurs from these addresses, but they resolve to *.phx.gbl (what is that, anyway, Microsoft?!) not to the expected *.search.live.com. Because of this, any crawls from these IP addresses do not authenticate and so are blocked here on eKstreme.com and blogSci.com.
  • Where is Yahoo! Slurp's bot authentication? It was promised way back at the end of March, but I'm still seeing about half of the Slurp requests from *.inktomisearch.com in addition to the promised *.crawl.yahoo.net that allows for authentication.
  • Tailrank's bot does not obey robots.txt. I emailed them a while back and they promised the next (then-imminent) update will fix that, but nope, not yet. Tailrank still rummages through eKstreme.com without regard to how it should behave.
  • Techmeme's bot does not identify itself. The remote host comes up as techmeme.com (75.126.195.146) and the user agent is Mozilla/5.0 (compatible; Wazzup1.0.7613; http://70.86.131.10/Wazzup).
  • There is a lot of crawling activity from *.amazonaws.com. A quick background: Amazon is developing a whole great set of platform services called Amazon Web Services (AWS). I recommend everyone read up on them, especially EC2 and S3. The AWS crawling activity is mostly web startups looking to index the web or feeds - so fine. What I'm seeing is more and more evidence for scrapers running off AWS, which is not good. How to deal with this is a tough one: block everything from *.amazonaws.com or should Amazon personally identify the account holder using the remote host (like accountname.amazonaws.com)? The latter will strongly discourage scrapers and help in authenticating bots. If the scrapers get more frequent, everything on AWS will be blocked, including the good ones. Amazon needs to act now before this becomes a problem that affects all their customers.
  • More on MSNBot, sometimes it doesn't obey the robots.txt file. Hits from 65.55.208.139 are ignoring this command:
    User-agent: * Disallow: /dev
    That's only started in the last few days. Weird.

I'm sure there is more I've missed, but this will do for now. I'll leave you with a parting thought perhaps hinting at the future of search at MSN/Live: the HTTP_ACCEPT header MSNBot sends with all requests is this:

text/html, text/plain, text/xml, application/*, Model/vnd.dwf, drawing/x-dwf

This is very interesting because:

  • application/* is all applications and binary files, including ZIP files. We talked about how Google is indexing (badly) binary data and how that's showing up in the SERPs. What are MSN/Live's plans there I wonder? It could be simply to index PDF files so the question is, can MSNBot 'see inside' ZIP files?
  • Model/vnd.dwf and drawing/x-dwf: Very interesting. Model/vnd.dwf is Autodesk Design Web Format and so is drawing/x-dwf, which as far as can understand are text-based representation of designs for web delivery. Will we start seeing AutoCAD designs in Live Image Search soon? As I like to say, this is "fertile ground for speculation" ;)

A Beginner’s Guide to PHP Processing Forms

Joe Dolson has a great article about Processing Forms with PHP, a Beginner’s Guide. Excellent read.

Thanks, Joe!

« Previous Entries  

Site Navigation

Blog Categories

Popular Pages

The most popular pages on eKstreme.com.

Search

Subscribe

Subscribe to RSS 2.0 feed

Community

 
thermodelly