Houston, We Have a Twitter

That’s right folks. Today I decided to actually do something about my Twitter account. Follow me at pierrefar.

The question is *what* will I do with the account? It may be a few days before I dive in properly :) See you @twitter

How to *REALLY* Deal with Hackers

Donna over at SEO Scoop asks an excellent question: more and more we’re seeing website attacks for SEO purposes, not more malicious intents (like stealing credit card details). Donna asks, how should we deal with this kind of attack? I’m going to hazard some suggestions.

First things first. We’re not dealing with hackers. Nosiree, we’re dealing with crackers. A hacker is a well-seasoned coder. A cracker is a hacker who exploits security holes for nefarious purposes.

With semantics out of the way, here are some suggestions:

  • Googlebomb yourself: If you get attacked with, for example, the Slash One WordPress exploit, essentially you’re going to get a lot of spammy "content" pages and lots of links to them. So what happens if you use .htaccess or otherwise to redirect all request to wp-content/1/* to, say, your site’s home page? Or why not to your newly minted, specially created, [Texas holdem play online] site? Hey, you’re probably going to get a lot of traffic, so use it! Here is the code:
    RewriteEngine On
    RewriteRule wp-content/1(.*)$ http://my-new-spammy-aff-site.com [R]

    Essentially, you’ll googlebomb yourself with their links and use their traffic.

  • Use robots.txt as a defensive tool: A search engine doesn’t need to see wp-content anyway, so block it:
    User-agent: *
    Disallow: /wp-content

  • It’s the keywords stupid: you just got someone dump a load of keyword-laden pages with targeted keyword links back to them. Hello? Anyone care to turn this into a keyword research tool? Here is the pseudocode for the tool:
    Do a Google search for [inurl:wp-content/1]
    Scrape the URLs from the SERPs
    Scrape the spammy URLs
    For each spammy URL, do a [link:] search
    Scrape the backlinks and extract the anchor texts
    Save the keywords along with the spammy HTML
    Write a front-end to search the database
  • Report them! Figure out the IP address of the person who uploaded the spammy pages and report them. If you get trackback spam to the spammy pages, find the IP address of the trackback spammers and report them. Most SEO spammers will be using hosting services and their own computers. It is possible (although I’m guessing unlikely) they’ll be using a proper botnet.

So like pretty much in SEO, perhaps even this can be dealt with using some creativity… I’m sure there are better ways to deal with such spam, and the idea is to think about the opportunities here. Good luck!

MS Live Still Referral Spamming

That’s right folks, after the initial fuss, the backtracking (with its very own official statement!), Microsoft’s Live search engine is still doing these referral spamming requests.

I’m seeing this on my new service Social Alerter. The request details:

  • Remote: livebot-65-55-165-77.search.live.com (65.55.165.77)
  • UA: of Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
  • Referring URL: http://search.live.com/results.aspx?q=social&mrt=en-us&FORM=LIVSOP

Full list of IP addresses doing this:

  1. 65.55.165.90
  2. 65.55.165.43
  3. 65.55.165.96
  4. 65.55.165.120
  5. 65.55.165.100
  6. 65.55.165.76
  7. 65.55.165.16

The fake search queries are all either [social] or [alerts].

Anyone else seeing this? It’s clearly not fixed as they claimed and is starting to get annoying.

Announcing Social Alerter

Doesn’t it suck when you discover your site is down because a page went popular on Digg? Wouldn’t it be nice if you somehow knew that your site is slowly inching its way up the upcoming list? And what about delicious? That could be a serious hit of traffic too.

Well now you can get a warning. Over the past few months, I’ve been slowly building a service called Social Alerter. Social Alerter is a free service that alerts you when your websites are about to go popular on Digg and delicious. You can monitor as many sites as you want and once it finds one, it sends you an email. You can use it to monitor your own sites, your competitors’ sites (ha ;) ), and your favorite sites. You simply sign up and know that there is an eye out doing all the leg work.

This is the service in a nutshell. I’ve written a huge help section and if you read just one page, read the Social Alerter crash course.

GTalk Translator Bot is Mediocre but Useful

By now you must have heard that Google Talk now includes translation bots you can invite into a conversation. When you invite any of these bots, they translate whatever you type from your language into the target language. A very brilliant idea with a perfect implementation mechanism, but does it work? Let’s find out.

I’ve mentioned before that I am an Arabic speaker. Given that Arabic sports one of the most convoluted grammars on Earth, I thought what better way to test the bots by having a solo chat with the en2ar bot. That is, I write in English and watch its Arabic responses. The results are below:

Google Talk translation bot conversation translating English to Arabic

Arabic speakers among you will spot many mistakes but the ideas are still mostly translated well. With basic phrases, the translation is flawless in most cases. With more convoluted writing, the translation breaks down. You can see two comments relating a bad translation. The first one said "This translation sucks" which colloquially in English, that means it’s bad. The translation used the meaning of "suck" literally, i.e., something you’d do to straw and some juice. The next phrase saying "This is a bad translation" was translated, well, badly, but the idea was still conveyed. The translation in Arabic actually says "This is the bad of translation". This grammatical structure is used in Arabic to emphasize the pinnacle of something (i.e. exemplary in its class), so in this case, the Arabic actually means "This is the worst of translation".

So all in all a useful feature but I don’t see it being used for anything important like a business chat: the mistakes are simply too frequent for this to be used to convey complex ideas. It is machine translation after all and the state of the art is still bad.

Query String Collapsing

One of the problems that search engines and analytics packages have is dealing with URLs with query strings. For example, the following two URLs will be return the same content from any given content management system but they are two different URLs in the eyes of search engines and analytics packages:

http://example.com/page.php?id=1&title=hello&from=homepage

http://example.com/page.php?title=hello&from=homepage&id=1

So how can we figure out that they are actually the same URL really? The solution I came up with is a simple multi-step processing algo. It goes like this:

  • Take the query string variables and save them in an array. So in the case of our first URL, the array would contain the following key=>value pairs:

    $vars = array(‘id’=>’1′, ‘title’=>’hello’, ‘from’=>’homepage’);
  • Next, sort the array by alphabetical order based on the keys names, like:

    $vars = array(‘from’=>’homepage’, ‘id’=>’1′, ‘title’=>’hello’);
  • Now rebuild the URL based on the new order of the variables:

    http://example.com/page.php?from=homepage&id=1&title=hello
  • By now the trick should be clear: if you do that to all the URLs, you would always reach the same final re-composed URL as long as the variables are same (i.e. the same names and one URL doesn’t have extra or missing variables).

I call this Query String Collapsing. Why "collapsing" instead of normalization or decomposition? No real reason apart from thinking about this as collapsing a whole slew of URLs into a single representative entity. And I just like that name more that way :)

With this, what can we do with analytics? Save both the original URL as requested and the collapsed URL. This opens up a nice set of funky things you can do, but that’s another post…

Irony

Support Wikipedia!

Hint: Look at the source code…


MS Admits to Referral Spamming for As Cloaking Check

Hot off the press: after the fuss raised by a bunch of us a few weeks ago, Donna now reports that Live ponies up about the referrer spam. They’ve issued a statement where they:

  • A bug that caused issues with AdSense/Overture reporting.
  • Distorting site statistics with unfilterable bot traffic (except we know how to filter them!)
  • Polluting HTTP logs with inappropriate terms (true).

Microsoft also states that "Hopefully webmasters have also noticed these issues disappearing. If you are still experiencing any issues, please contact us before you block MSNBot, to see if we can address the issue."

Let me be the first to say a big thank you to Microsoft for making a very solid public response to the issue and answering our questions. This kind of transparency is exactly what fosters a good relationship between a search engine and webmasters.

And yes, Live.com team, I do default to your search engine for my searches. Works a treat (most of the time ;) ).

Yell if Microsoft’s Live.com Spammed You Too – Updated

The bot analysis continues, and this post presents evidence indicating that Microsoft is spamming websites. A big claim, I know, but I can’t find a better explanation. You’ll have to decide.

The summary: IP addresses belonging to Microsoft are requesting pages from eKstreme.com and blogSci.com (my science blog) with HTTP referer headers suggesting that the hits were from live.com searches. These referer headers are spoofed as the keywords from these supposed searches are sometimes in no way related to the requested page. Additionally, for most of the other supposed searches, the requested pages do not rank in the top 10 (first page of results) in a way to send this traffic.

For some odd reason, the webmaster community has known about this for a couple of months. In September, SE Roundtable posted about other webmasters complaining about this spam. Surprisingly, we also got official confirmation (via a WMW thread) from msndude that this indeed happening and it’s (and I’m quoting) "part of a quality check we run on selected pages". This is an unacceptable explanation as you’ll see from the data below because it has none of the hallmarks of a quality check but all the marks of referral spam.

The hits discussed below are extracted from the blogSci.com data to keep things simple, but a similar data set exists for eKstreme.com.

The Hits

The whole list of hits is way too long to quote in full here, so here is a sampling of my favorite requests:

  • At: 17 August 2007 05:53:27 PM GMT
  • Routed to: /index.php
  • Referred from: http://search.live.com/result.aspx?q=make+money+online&mrt=en-us&FORM=LVSP
  • Remote: bl2sch1082213.phx.gbl [] (65.55.165.119)
  • Request: HTTP/1.0 GET
  • Accepting:
    • HTTP: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
    • Charset:
    • Enconding:
    • Languages: en-us
  • UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
  • Cookies:

  • At: 18 August 2007 03:05:43 PM GMT
  • Routed to: /index.php
  • Referred from: http://search.live.com/result.aspx?q=make+money+online&mrt=en-us&FORM=LVSP
  • Remote: bl2sch1082008.phx.gbl [] (65.55.165.66)
  • Request: HTTP/1.0 GET
  • Accepting:
    • HTTP: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
    • Charset:
    • Enconding:
    • Languages: en-us
  • UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
  • Cookies:

These two hits above are the first I have in my records. What’s amusing about them is that both supposedly came from a search for [make money online].

  • At: 19 August 2007 03:55:48 AM GMT
  • Routed to: /index.php
  • Referred from: http://search.live.com/result.aspx?q=ticket&mrt=en-us&FORM=LVSP
  • Remote: bl2sch1081815.phx.gbl [] (65.55.165.25)
  • Request: HTTP/1.0 GET
  • Accepting:
    • HTTP: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
    • Charset:
    • Enconding:
    • Languages: en-us
  • UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
  • Cookies:

This one is also very random: a blog post about a cool new magnet-based technology to create colors is ranking in the top 10 for the query [ticket]? Not even Live.com generates such irrelevant results.

Anything more recent? Sure:

  • At: 11 November 2007 03:26:43 PM GMT
  • Routed to: /index.php
  • Referred from: http://search.live.com/results.aspx?q=osteoporosis&mrt=en-us&FORM=LIVSOP
  • Remote: bl2sch1081815.phx.gbl [] (65.55.165.25)
  • Request: HTTP/1.0 GET
  • Accepting:
    • HTTP: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
    • Charset:
    • Enconding:
    • Languages: en-us
  • UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
  • Cookies:

  • At: 11 November 2007 03:29:24 PM GMT
  • Routed to: /index.php
  • Referred from: http://search.live.com/results.aspx?q=amazon&mrt=en-us&FORM=LIVSOP
  • Remote: bl2sch1081909.phx.gbl [] (65.55.165.43)
  • Request: HTTP/1.0 GET
  • Accepting:
    • HTTP: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
    • Charset:
    • Enconding:
    • Languages: en-us
  • UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
  • Cookies:

At the time of writing, there are 245 such hits in my records since August 2007.

Not convinced? There is more. Some of these hits came within seconds of being indexed by MSNBot. The pattern is like this: the page is requested by MSNBot (which is authenticated, so it’s genuine) and within a few seconds, the very same page is requested as described above with a live.com search are referer. An example:

  • At: 10 November 2007 12:05:14 PM GMT
  • Routed to: /index.php
  • Referred from: (No referer.)
  • Remote: livebot-65-55-209-143.search.live.com [] (65.55.209.143)
  • Request: HTTP/1.0 GET
  • Accepting:
    • HTTP: text/html, text/plain, text/xml, application/*, Model/vnd.dwf, drawing/x-dwf
    • Charset:
    • Enconding: identity;q=1.0
    • Languages:
  • UA: msnbot/1.0 (+http://search.msn.com/msnbot.htm)
  • Cookies:
  • At: 10 November 2007 12:05:36 PM GMT
  • Routed to: /index.php
  • Referred from: http://search.live.com/results.aspx?q=problem&mrt=en-us&FORM=LIVSOP
  • Remote: bl2sch1081810.phx.gbl [] (65.55.165.20)
  • Request: HTTP/1.0 GET
  • Accepting:
    • HTTP: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
    • Charset:
    • Enconding:
    • Languages: en-us
  • UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
  • Cookies:

The typical delay between the indexing request and the spoofed search hit request is 5-20 seconds.

How to Recognize the Fake Hits

Anyone staring at these hits long enough will see some signatures to detect them:

  • Note how all of them have identical user agents (UA field) and pretty much everything else is identical (bar the the requested page and the referer).
  • The IP adresses all belong to the same C-block, namely 65.55.165.*.
  • All of the query strings in the live.com referrers have &mrt=en-us in them. Here in the UK, I get &mkt=en-gb when I really use Live.com for a search.

Needless to say, this smells like bot behavior.

An Analysis

Let’s think about this for a minute: What on Earth is going? Why are these hits happening? I can think of two explanations:

  • The tinfoil/sinister explanation: pure spam from MS. Why? So that webmasters see Live.com referrals coming in increasing numbers. This is not hard to hide: if you only get like 10 referrals from live.com a month, another 10 is a doubling but which sad webmaster would check those out (apart from me)?
  • The "surely not" explanation: this is an automated way to check the search results to see where pages rank for keywords the page could potentially rank for. This is what msndude confirmed in the WMW thread, but as you can see above, it doesn’t really look like a quality check. Also, if this is indeed a quality check, why not run it on the cached pages and not alert (and annoy) the webmasters? Microsoft have full access to their index and they should use it!

I subscribe firmly to the first explanation: the search keywords are spammy in some cases, always too general, the requested pages never rank in the top 10 as the referring URLs would suggest, the hits have identical user agents (i.e. not the typical variation you would expect from various people using normal browsers on different operating systems withing the same company to show) and the actual referring URL does not match what a human being searching on live.com generates.

In short: it’s spam and not a quality control check. What do you think?


Update 1: DazzlinDonna from SEO Scoop has written an excellent background to this fiasco, and Michael VanDeMar is reporting that Microsoft is interfering with AdSense. Ouch.

Update 2: Yuri explains more background and asks What happens next?. Reuben Yau and Kichus have both blocked the IP addresses. Boy are people angry.

Update 3: This story has evolved since I wrote it. Some follow up posts:

PHP Auto Prepend and its Uses

A thread at SEO Refugee started with asking for help about a funny URL, which we deciphered to be a probe to try to attack the website. I’ve seen this before and so I suggested that they block the whole IP C-block. Which turned into the question of: how do you block IP addresses?

The way I do it is using PHP and it uses a nice little trick that few seem to know about. This is how:

PHP has a feature that allows you to pre-pend a file at every PHP request. This prepend file is the equivalent of having it include()ed at the top of every single PHP script on your site. It’s is done through a directive that is set either in php.ini or .htaccess. The directive is called auto_prepend_file. For .htaccess, this is what I use:

php_value auto_prepend_file "/full/path/to/a/prepend-file.php"

Because it runs at every PHP request and it runs before the actual requested script, you can do some really neat things. So what do I do? I’m developing this system internally and at the moment it does three things:

  1. Authenticate SE bots
  2. Analytics (the data logger)
  3. Block IP addresses

The blocking works as follows: there is a special directory where I put empty files that dictate the blocking. The file names are of two formats: a.b.c.d or a.b.c depending if want to block a specific IP address (the former format) or a C-block (the latter). In the pre-pend file, there is a simple check: figure out the remote IP address, and check the for the presence of either its file or its C-block file. So if the remote IP is 111.222.333.444, it checks for the prsence of either /111.222.333 or /111.222.333.444. If either exist, a 403 not authorized header is returned and the code exit()s, so no actual content gets displayed.

This raises the question: how do you add files to the directory? Using a web interface of course :) You can do it with a simple touch() or an fopen().

For completeness, there is a sister directive called auto_append_file which runs after each PHP script is called (with the exception that if script exit()s, the append file doesn’t run). I’ve never used it, but it can be useful for things like measuring how quickly scripts run on your server.

Googlebot Requested Another JS File

Back in January, I blogged about how Googlebot requested CSS and JS files. Ever since the news broke (to much fanfare), this front of SEO has been awfully quiet.

Well, GBot is at it again. This time it requested just a JS file and followed a 302 redirect to get it. The JS is embdeded into the page with this code:

<script type="text/javascript" src="filename.js"></script>

The request for filename.js 302 forwards to another location which returns a 404, which I log. Yes, it’s a trap designed to see who requests JS files and who doesn’t (as part of my on-going bot research documented the past few posts).

The hit details are:

  • Remote: crawl-66-249-70-178.googlebot.com (66.249.70.178)
  • Using HTTP/1.1 GET
  • UA: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

The second request (the one GETting the URL in the 302 forwarding) came 1 second after the first request with identical hit settings.

And that’s about it from me. Anyone else seeing similar hits?

Yahoo! SERPs Embed API RSS Feed

Something I’ve never seen before. In Firefox, I noticed that the Yahoo! SERPs are now showing the feed icon. For example, the search for [seo blog] embeds an API-based RSS feed of the query. Interestingly, the app identification (a required field for all API calls, and is supposed to be developer specific) is "yahoosearchwebrss".

I tried this for a few searches, both simple searches, like [seo blog], and more advanced ones like [inurl:ekstreme.com]. It’s showing up for the UK-locali(s)zed old SEPRs interface and the newly redesigned one.

I’d like to say this is a very cool idea, one to be filed under "Why didn’t I think of that?" Well done Yahoo! for another innovation in SERPs.


Programmers: What’s your font?

An odd question to ask, but seriously: which font do you use in your text editor? Anything exotic? Is it monospaced or variable width? Serifed or not?

I ask because… well, no reason really. Just curious about what unsung heros we use. I use Verdana because it’s easier to read, and I’m now experimenting with Segoe UI. Never got into the monospaced fonts, even for programming.

So try changing your font today. You might just like it!

How to Crawl the Web like Slurp, How to Block Such Scrapers, and Explaining Weird Slurp Requests

Yet another bot-related post. The subject this time is Yahoo!, specifically Yahoo!’s crawler Slurp and the Babelfish translation service. The topic is weak SE bot authentication, specifically, the ability to scrape content from sites that use a weak form of Slurp authentication. Also, while researching this, I came up with an explanation for weird hits coming from Yahoo’s IP addresses.

All search engines introduced a mechanism to authenticate their bots. Yahoo! said back in June that all Slurp hits will come from *.crawl.yahoo.net, but my recent observations showed that’s not the case. Still, it doesn’t matter for scraping purposes as there is a way to mimic Slurp’s behavior almost 100%, perhaps enough to get away with it.

To see it in action, using Firefox and the User Agent Switcher extension, set your UA to Slurp’s namely "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)". Next, go to Babelfish and type your website’s URL choose a language pair, and translate the page. The bottom frame is the one we’re interested in, so right click on that and choose "This Frame->Show only this frame". Next check your log files.

In eKstreme.com’s case translated into French, the frame’s URL is this:

http://74.6.146.244/babelfish/translate_url_content?.intl=us&lp=en_fr&trurl=http%3A%2F%2Fekstreme.com

And the interesting hit details:

Remote host: proxy3.search.scd.yahoo.net (66.94.237.142)

UA: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) (via babelfish.yahoo.com)

Notice how close this is to a real Slurp hit: it has the word ‘Slurp’ in the user agent and comes from yahoo.net. So any (weak) Slurp authentication checking simply for Slurp and yahoo.net will be fooled. However, there are two key differences to allow for proper authentication:

  • The hit is not from *.crawl.yahoo.net, but from *.scd.yahoo.net.
  • The user agent has "(via babelfish.yahoo.com)" appended.

So in short: if you really want to authenticate Slurp, really do check for *crawl.yahoo.net in the remote host, not simply yahoo.net. Incidentally, that’s the advice Yahoo! gives, so follow it!

There is more to this story. When Babelfish translates a page, it requests the URL to be translated twice. The first one is a HTTP HEAD request, and all being well (like not getting a 404 error), the page is properly requested using HTTP GET, which is what I described above. If an error is encountered, the GET request is not sent.

The HEAD request is very interesting as it sets the UA to be identical to the browser requesting the translation, without adding "(via babelfish.yahoo.com)". So what the hit will appear as in log files is a UA coming from *.yahoo.net. Which UAs can you see? Anything out there: I’ve seen bots like Sogu, AdSense Mediabot, IE, and Firefox. If you spoof your browser to be Slurp as described above, you’ll get a Yahoo! Slurp request coming from *.scd.yahoo.net not from *.crawl.yahoo.net, meaning that the Slurp authentication will fail.

This, I believe, explains a question someone posted a year ago at Webmaster World. Following on from a different thread, Yahoo_Mike explained what was going on but didn’t explain the details of the double-requests.

So in summary, three things:

  • Double check how you authenticate Slurp and make sure you’re doing it properly to avoid scrapers.
  • We now have an explanation of the weird behavior seen from *.yahoo.net proxies with details about exactly what’s going on.
  • Why doesn’t Babelfish identify itself with the HEAD requests? Come on Yahoo!, you can do better!

Google Web Accelerator: Please identify yourself

A while back, I noticed some fishy bot-like behavior coming from a Google-owned IP address. After asking around, a friend suggested it could be Google Accelerator. So I emailed Google support and, cut a long story short, it indeed was Google Accelerator (GWA for short).

The IP address back then was 64.233.172.34, which Google confirmed to be a public-facing GWA IP address. The hits were very bot-like: no referer, requesting pages blocked by robots.txt, and identifying themselves using the default user agents for IE or Firefox. However, the hits also showed atypical bot signs: Looking through the log files, I noticed that after the page is requested, all associated files are also requested: the Javascript files, the image files, and the CSS files. Interesting in its own right because remember, the hits are coming from a Google IP address but are really requests from real users – the GWA was acting as a proxy. I hope the implications of this are clear.

Regardless, I dropped it – my question was answered. But now it’s back…

Over the past 10 days or so, a new IP address started to show the same pattern. This time, the IP address is 66.249.85.133 and it certainly belongs to Google. It resolves ff-in-f133.google.com and requests using HTTP/1.1 and asking for gzip’ed pages. The requested pages are still ones blocked by robots.txt, identify themselves as IE 6.0 (default user agent), and come in without any referer. However, this time associated JS files are not requested, putting the new behavior firmly in botland.

So far, I’ve noticed only a few hits, none of which identified themselves as Firefox. Given the history, my best bet at the moment is that it is GWA again on a new IP address, but the lack of JS requests makes me wonder if they also updated the code – maybe for analytics purposes? Regardless, GWA is still acting as a proxy, and so I expect it to identify itself as such. It can easily modify the user agent to hint that it’s there. At the very least, it will be useful for analytics; examples of why identification is useful:

  • How many GWA requests does your site get?
  • Are GWA requests labelled as bots and discounted?
  • Should GWA requests be labelled as bots? This is more philosophical than technical.
  • Can GWA be used to scrape websites?

And of course, many more questions. So if anyone works for Google maybe you can spare a minute for this? :D

Firefox Extension Spying on Us? – Updated

Update: no more database logging. Details at bottom of this post…

The world of SEO went all smiling a few days ago with 97th Floor publishing their Social Media for Firefox extension. I think it’s a great idea; Chris thinks so too, and SEOMoz are terribly excited by it. But it’s spying on us. Let me explain.

Open up a new web page, say http://ekstreme.com/, make sure the SM for Firefox extension is in manual mode, and open the Live HTTP Headers extension. Now click the Manual button in the SM for Firefox extension and watch the headers scroll by.

You should see a few blocks of text: one for Digg, one for delicious, one for Stumble Upon, and one for reddit. The last request though is a request to the 97th Floor website. In eKstreme.com’s case, the URL is:

http://www.97thfloor.com/social-media-for-firefox/put.php?url=http%3A%2F%2Fekstreme.com%2F&service3=3&service1=2&service4=0&service5=0

and the full headers are:

———————————————————-

http://www.97thfloor.com/social-media-for-firefox/put.php?url=http%3A%2F%2Fekstreme.com%2F&service3=3&service1=2&service4=0&service5=0

GET /social-media-for-firefox/put.php?url=http%3A%2F%2Fekstreme.com%2F&service3=3&service1=2&service4=0&service5=0 HTTP/1.1

Host: www.97thfloor.com

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6

Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5

Accept-Language: en-gb,en;q=0.5

Accept-Encoding: gzip,deflate

Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7

Keep-Alive: 300

Connection: keep-alive

Cookie: MintUnique=1; MintUniqueMonth=1188626400; MintUniqueWeek=1189317600

HTTP/1.x 200 OK

Date: Fri, 14 Sep 2007 23:10:41 GMT

Server: Apache/1.3.37 (Unix) mod_fastcgi/2.4.2 mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 FrontPage/5.0.2.2635.SR1.2

mod_ssl/2.8.28 OpenSSL/0.9.7a PHP-CGI/0.1b

X-Powered-By: PHP/5.1.6

Keep-Alive: timeout=15, max=100

Connection: Keep-Alive

Transfer-Encoding: chunked

Content-Type: text/html

———————————————————-

Notice anything fishy? A filename called put.php (put where? A database?) on the 97th Floor website telling it the URL I just requested info for along with some service data. Surely you’re not spying on our social media activities 97th Floor… are you?

You’ll notice from the headers that the put.php file returns text/html. What is the HTML? Browsing to the URL returns a blank page with one word: "Done". Done what, my dear? Logged the data into the database have we?

And are you tracking the hits with Mint too? Very slick.

So with all due respect, the extension is now uninstalled untill we get a clear explanation from 97th Floor. Come on, the, errr, floor is all yours.


Update

After blogging the details above, I emailed a few people as a sanity check and to raise the alert. One of the people I emailed got in touch with Chris Bennett of 97th Floor, and so Chris emailed me and commented below. The summary of our discussions:

  • Yes there was data logging, but it was error logging. The data being sent via the URL is consistent with this, an I see no other evidence to shed more light on the question.
  • Chris emailed me a link to the database dump/report. It contained URLs and numbers associated with each of reddit, Digg, delicious, and SU for each URL. The download was huge – I stopped it at ~7MB.
  • Most of the URLs I saw in the database are harmless: news sites, blogs, etc.
  • Some of the URLs were bad to have in there: I didn’t know this, but Google apps apparently has some URLs with usernames attached. There are other web apps like that. It’s generally a bad idea to tie a username to a login URL (i.e., giving a cracker half the info they need…), but the system still won’t log you in automatically.
  • Some URLs are really dangerous to have. Some login systems have a step in the login process that creates a unique URL associated with that session. Anyone who knows this (very hard to guess) URL, is logged in automatically, without a password being asked again. Yes, there was at least one URL on such a system in the database.
  • The bad URLs were logged when people left the extension in automatic mode.

So what’s the conclusion: given what I know (all summed up above), how Chris reacted and how other people I know and trust said about Chris, my opinion is that this is an innocent mistake that had serious consequence. There is no evidence of malice that I know of, and regardless, it’s now fixed.

Less than 24 hours of me blogging the post, Chris has now released an updated extension and an apology for the whole thing. I installed the new extension and so no ‘phone-home’ activity in four different test URLs.

Chris should be commended for his quick and decisive response. I for one am happy to move on. But for everyone out there, the usual ‘keep your eyes open’ warning always applies. Next time it won’t be someone who fixes the problem.

Good Bots Gone Bad

I’ve been keeping a very close watch on bots hitting eKstreme.com lately and I’ve come up with some interesting observations. Some of them are of great importance to webmasters (like MSNBot’s entries) and others are more just FYI. In no particular order:

  • Feedburner’s bot does not obey robots.txt, specifically, this command:
    User-agent: *
    Disallow: /socializer/?

    It sends a HTTP 1.1 HEAD (not the usual GET) requests to the Socializer’s bookmarking pages. The remote host is chi-fetch.feedburner.com (66.150.96.121) and the user agent is FeedBurner/1.0 (http://www.FeedBurner.com).

  • MyBlogLog still has an empty user agent string. I posted about this on Cre8 detailing how I emailed MBL support and they promised (within minutes) to forward it to an engineer. The remote host is www1.mbl.sp1.yahoo.com (69.147.90.63). If you browse to that, you get the MyBlogLog home page. Why should you care? Because if you block empty UAs as an anti-scraper method, MBL will get blocked too.
  • MSNBot’s authentication doesn’t work. Yes, MSN’s bot authentication is BROKEN for these IP addresses: 65.54.165.43, 65.55.235.216, 65.54.165.65, and 65.55.233.40. A lot of crawling activity occurs from these addresses, but they resolve to *.phx.gbl (what is that, anyway, Microsoft?!) not to the expected *.search.live.com. Because of this, any crawls from these IP addresses do not authenticate and so are blocked here on eKstreme.com and blogSci.com.
  • Where is Yahoo! Slurp’s bot authentication? It was promised way back at the end of March, but I’m still seeing about half of the Slurp requests from *.inktomisearch.com in addition to the promised *.crawl.yahoo.net that allows for authentication.
  • Tailrank’s bot does not obey robots.txt. I emailed them a while back and they promised the next (then-imminent) update will fix that, but nope, not yet. Tailrank still rummages through eKstreme.com without regard to how it should behave.
  • Techmeme’s bot does not identify itself. The remote host comes up as techmeme.com (75.126.195.146) and the user agent is Mozilla/5.0 (compatible; Wazzup1.0.7613; http://70.86.131.10/Wazzup).
  • There is a lot of crawling activity from *.amazonaws.com. A quick background: Amazon is developing a whole great set of platform services called Amazon Web Services (AWS). I recommend everyone read up on them, especially EC2 and S3. The AWS crawling activity is mostly web startups looking to index the web or feeds – so fine. What I’m seeing is more and more evidence for scrapers running off AWS, which is not good. How to deal with this is a tough one: block everything from *.amazonaws.com or should Amazon personally identify the account holder using the remote host (like accountname.amazonaws.com)? The latter will strongly discourage scrapers and help in authenticating bots. If the scrapers get more frequent, everything on AWS will be blocked, including the good ones. Amazon needs to act now before this becomes a problem that affects all their customers.
  • More on MSNBot, sometimes it doesn’t obey the robots.txt file. Hits from 65.55.208.139 are ignoring this command:
    User-agent: *
    Disallow: /dev

    That’s only started in the last few days. Weird.

I’m sure there is more I’ve missed, but this will do for now. I’ll leave you with a parting thought perhaps hinting at the future of search at MSN/Live: the HTTP_ACCEPT header MSNBot sends with all requests is this:

text/html, text/plain, text/xml, application/*, Model/vnd.dwf, drawing/x-dwf

This is very interesting because:

  • application/* is all applications and binary files, including ZIP files. We talked about how Google is indexing (badly) binary data and how that’s showing up in the SERPs. What are MSN/Live’s plans there I wonder? It could be simply to index PDF files so the question is, can MSNBot ‘see inside’ ZIP files?
  • Model/vnd.dwf and drawing/x-dwf: Very interesting. Model/vnd.dwf is Autodesk Design Web Format and so is drawing/x-dwf, which as far as can understand are text-based representation of designs for web delivery. Will we start seeing AutoCAD designs in Live Image Search soon? As I like to say, this is "fertile ground for speculation" ;)

Socializer Update

A quick update to the Socializer:

  • Mister Wong has been added to the services, in the Top Services section. Why? They’re very large in Europe with multi-language support and a dedicated user base.
  • Netvouz was moved into the Top Services section. I did a quick check of the rankings (like this) and it certainly deserves it. Heck, it out-ranked Slashdot! I’ve kept Slashdot in the Top Services section because of their new Firehose service. I’m going to be keeping an eye on it.
  • The web-marketer’s Digg, Sphinn, has been added to the list. It certainly a niche service, so it’s not grouped as a Top Service. However, I’m going to be keeping a very strong eye on it and PlugIM. Depending on the rankings, a switch might be in order. We’ll see.

We now return to our regularly scheduled silence :) Yes, I’m very sorry for the lack of posts but I’m working on a new service that will launch shortly. It’s been taking up quite a bit of time. More soon – like this weekend soon!

eKstreme.com Reloaded

As you may have noticed, eKstreme.com now sports a brand new design. Not only that, it’s also been ported to a new backend, namely WordPress. On top of that, it’s on a new host now that promises to be be more reliable and a whole lot faster.

Yep, a hat-trick of news that’s been 3 months in the making. If you ever wondered why I haven’t been replying to emails much, or why I haven’t been blogging much lately, this is why. A lot of effort went into this, and I’m sooooo relieved it’s over… almost.

Things will be broken! I know that. Firstly, the tools are complex beasts to run completely in WP. Yet they do, for the most part. Two tools are broken at the moment: the spell checker and the Backlink Social Celebrity tool. The first one is waiting on some software to be installed on the server, so it might take a few days. The latter is just sub-par coding on my part and I decided that instead of trying to fix it, I’d re-write the thing. Rest assured I’m on the case and I’ll have them up ASAP.

Having said this, please file bug reports and feedback using the contact form.

Speaking of which, there are two people I’d like to thank wholeheartedly:

  • Joe Dolson who patiently gave constant feedback on the design as it was evolving (slowly) and helped troubleshoot some DNS issues.
  • Mike Cherim because eKstreme.com now uses the WP port of his contact form. It was very easy to add it in and it works a treat.

So, thanks to all the readers and users of eKstreme.com. This redesign was done with you in mind. As always, feedback welcome, and there is a LOT more to come now!

Free Lunch 404

Time Magazine published a nice article by Bill Tancer (of Hitwise fame) talking about the top 10000 keywords searched for that contain the word ‘free’ like [free games] or [free myspace layouts]. The analysis is very interesting for anyone into keyword research but amusingly, he found that no one is searching for [free lunch].

It’s not long, and well worth a read.

[tags]hitwise, free, keyword research[/tags]

« Previous Entries   Next Entries »