I’m Joining Google

Tomorrow morning I will start working at Google’s London office as a Webmaster Trends Analyst. SEOs will immediately know what that role is as I will be John Mueller’s team mate. To those that don’t know, it’s the team within Google that interacts with website owners and manages Webmaster Central among other things.

What does this mean for my websites? A few things:

  • I will not be building any new tools. All my tools that already exist here and elsewhere are going to remain as they are and will not be updated.
  • Very important corollary: The tools here on eKstreme.com and elsewhere, especially the SEO tools, were built before I joined Google and thus are not in any way official Google tools and must not to be taken as endorsed by Google. So please no one try this one. OK?
  • I’ve been running some SEO experiments here on eKstreme.com and elsewhere. Although I’ve stopped them, their effects might still be hiding in a cache somewhere or whatnot. If you see anything fishy/funny it is NOT an official Google recommendation of how to do things. Heck, as an SEO I’ve pushed the boundaries a bit and so if you try anything you think I’ve tried or trying, it’s at your own risk.
  • My OpenCourseWare search engine (OCW Search) is now officially part of the OpenCourseWare Consortium. As I will not be able to work on it from now on, I’ve decided to donate it to the Consortium instead of shutting it down. Its dream lives on and it is now in much more capable hands than mine, hands that even have time to work on it!

Some Yes questions: So will I start blogging more? Hopefully. Will I be at conferences? Yes, though I don’t know when. I’ll post here and hopefully get to meet more of the people I’ve been talking to over the years.

Some No questions: Will I tell you SEO secrets if you ask me? Of course not! Can you hire me? No! I’m done with the freelance work. Can we exchange links? Nope.

A little disclaimer: This website is still my own and anything written on it is my own personal view/opinion and not my employer. This is not an official Google blog.

Finally, you should follow me on twitter.

Awesome new free SEO tool: Blekko

As of today, 1 November 2010, a new search engine is open to the public: blekko. Regardless of what its ambitions (to be better than Google and topple them), it has a very useful treasure for internet marketers: a thorough SEo analysis of any URL or domain they have indexed.

Before I go through some of the data they share, note that they have a toolbar that allows you to "view SEO data in real time". And because they allow you to mark pages as spam to kill them completely from your search results, the toolbar has a button to do that too. Interesting way to crowdsource spam control.

Back to SEO data: To see it, you simply search for the domain name you’re interested in followed by a /domain. For example, for ekstreme.com: ekstreme.com /domain. That first page is a gold mine in its own right, as it tells you how many inbound links from how many domains. And for each domain it lists, you can dig in a bit deeper to understand that domain. This is actually the first in many tabs shown. Clicking through them you’ll see they display a few graphs (sadly, the ever useless pie charts), give crawl stats data, and duplicate content.

The crawl stats data hints at how fresh/stale their data is, and from testing a few domains I own and some of my SEO clients’ domains, it’s clear the freshness is a big issue: it’s very hit and miss, and for newer and/or smaller sites, they are lagging significantly. For example, when analyzing OCW Search, the numbers indicate they have crawled 8 pages (there are more ;) ) and the average page length is 0kb, which is clearly wrong. They also tell you the reverse IP address and for OCW Search it’s Amazon, which is the previous host that I moved away from a couple of months ago. So take this data with a grain of salt!

The last report I want to highlight is the comparison. By default when you click on the compare tab, you will be comparing the www domain with the non-www domain. Again, for ekstreme.com: ekstreme.com /compare. This tells you who is linking to each domain which is a good rough estimate of the problem size if you are dealing with canonical URL issues. BUT, the real kicker here is that you can compare different domains, for example ekstreme.com and www.ocwsearch.com (screenshot below). This is a gold mine for market analysis.

blekko search domain comparison

So all in all an excellent free tool, but one that suffers from stale data – the problem for all SEO tools. My recommendation is like any other SEO tool: use it while understanding the data’s limitations.

Finally, I’d like to note that this is a genius marketing strategy on blekko’s part: getting SEOs to talk about them and use them is a great way to get early traction. Us search geeks are the very very leading edge of early adoptors, so well done on spotting this opportunity.

Remember me?

Hello? tap, tap Is this thing on?

The last blog post was from way back in January. Things has moved on since then but I just haven’t had time to blog about them. So here goes:

  • I’m now a freelance web developer and SEO. Want to hire me? Look at my funnily-named SEO company website.
  • I built, launched, and got sucked into the resulting whirlwind of OCW Search, a free online courses search engine. OCW Search helps you find free downloadable courses from universities like MIT, Notre Dame, The Open University in the UK, and many more. Yes, I’m an SEO and I operate a search engine now. Let me tell you it’s an awesome experience.
  • I’m expanding OCW Search into a full-featured online education startup. Sign up to the mailing list to get access before everyone else :) You know you want to!
  • In June 2010, I gave a talk at MongoUK about the technology behind OCW Search.

And if you’re reading this, you’ll notice that eKstreme.com now looks different. Well I thought that the previous design was getting old and I was moving the site to a new server (and a new PHP framework I built), and thought this is a good chance to give it a fresh look.

Speaking of which, I retired a LOT of the tools and code previously available on eKstreme.com. Why? I just don’t have time to provide support and to keep updating them as APIs change and bugs are identified. I have a different focus now and it’s not fair for users or me to keep semi-functional code released without support.

Speaking at SES London 2010

I’m very happy to be confirm this: I will be speaking at the Automating Twitter session at Search Engine Strategies 2010 in London on 18 February 2010.

SES 2010 Logo

My talk will be about analytics for social media marketing. Whenever you launch a marketing campaign, you need to measure it in detail to understand its performance. I will talk about some of the important actionable metrics you need to track, with code examples of how to track them. Other key topics covered are filtering, automation, and reporting, all of which feed into experimentation to find the most effective marketing messages.

Google Alerts Now Spell Checks the Queries

Lately I’ve been noticing a lot of weird hits coming in via my Google Alerts emails. I’ve dug into it and I think I’ve figured out what’s going on: Google Alerts is spell checking the queries and matching the queries as it would do in a search. This in addition to matching the Alert queries exactly as previously. This new behavior kicked in about a week or 10 days ago.

For example: I keep an alert for [blogsci] because I have a website at blogsci.com. Up till recently, I used to get alerts only when the word "blogsci" was matched in a page. Now, I’m getting Alerts for pages that do not ever mention the word "blogsci" but the spell checked "blog sci". So I get matches for "…blog: sci-fi…". See what happened there?

Another example: I run a website with a domain name of XY.com where X is a word and Y is another word. My Alert is set to match it exactly as [XY]. This was going well until recently when I started getting alerts that match [X Y].

Another example: I have an alert for [cli.gs], my latest web app. I get a lot of spurious alerts for this because it matches [cli gs] which is a very popular combination apparently.

Anyone else seeing this weirdness? Any other interpretations? Thoughts in the comments please!

Hey YouTube: UK = GB, and both are English

Sometimes I see help messages that just leave me speechless. This message from YouTube about my automatically-set language preferences goes above and beyond anything I’ve seen in a long time because it has two big "WTF moments":

The problems?

  • The red circles: The suggestion that English (UK) is different from English (GB). Psst. They’re the same thing. It’s an exceptional reservation in the ISO standard.
  • The black circle: The whole message is apparently not in English because the link at the bottom right corner gives me the option to view it in English. When I click it, I get the same message, but instead of suggesting English (UK), it suggests just plain old English. And oh, it gives me the option to change my language to the real English of English (US).

Hey, I have news for you YouTube: English, English (UK), English (GB) and English (US) are all freakin’ English.

Yahoo! Search Doing a SERPs Usability Survey


I was just searching with Yahoo! and I saw a survey request from "Yahoo! Surveys". It was a big purple box to the immediate right of the results list, and it was anchored to the bottom of the screen (so even if I scrolled down, it went down too). I clicked on it before I realized I should have taken a screenshot, but I did take a screenshot of the single question in the survey. The question opens in a new window:

Photobucket

Click for full size, and no, I’m not going to tell you what my answer was :p

The SEOmoz Linkscape Ghost

If you’re part of the SEO industry, unless you’ve been livining under a rock for the past couple of days, you will know that SEOmoz launched a new tool called Linkscape, to much fanfare. First things first, congrats and kudos are due to the SEOmoz team for building such a complex beast. It’s not easy at the very least on the technical level.


But there is a problem: SEOmoz has not disclosed the user agent (UA) of its crawler. Here I will talk about why this is a bad thing, and also take a stab and go out on a limb and say: there is no SEOmoz crawler, at least not in the traditional sense. For the latter, I will offer a viable technical alternative, which may or not be correct, but the fact the alternative exists gives a sensible explanation as to why SEOmoz is not offering a straight answer to the UA question.

Why Disclosing the UA is Essential

Let’s not mince words: we as an SEO community like a little mud fight once in a while. We debate and discuss and yes fight. But one thing we all know how to recognize is malicious activity and differentiate it from aggressive activity.

Example: a bot scraping our content for an MFA site is a tolerated nusance. We take steps to negate the effects of scrapers but at the end of the day we don’t fight them hard. On the other hand, a bot probing for security holes is treated like a witch in 1209AD.

Which is why the Linkscape’s lack of disclosure hurts: We as a community work hard at identifiying bots. SEOmoz is supposed to be a good citizen of the SEO world, and yet the lack of transparency goes against the spirit and the image of SEOmoz. On the one hand we have a company with a strong community doing good deeds (SEO trademark fight anyone?) and yet it behaves in a way we expect out of the shady side of the net we deal with every day.

Not just that: the data collected from us, about us, will be used against us. It’s called competitive intelligence.

And not just that: SEOmoz is using the data to make money. The free version is pathetic and the Pro version needs a monthly subscription.

To me, this kind of behavior (stealth, harmful, and to make money) puts Linkscape squarely in the naughty corner. I certainly didn’t expect this out of SEOmoz. Tough luck Rand and co: you have a great brand and I for one expect better!

But I won’t ask for a UA because I think there isn’t one.

How To Build Linkscape

It’s actually quite easy on a conceptual level. However, just like cooking, having a recipe doesn’t make you a great chef – there are lots of details that SEOmoz must have tackled successfully to build Linkscape. I am not trying to belittle their achievment, and all I can show you is one recipe. This recipe is completely my guess and could very well be wrong. I have not talked to anyone at SEOmoz.

So come on Pierre, what is it? The answer is the Yahoo! Search API. It’s an API giving programmers complete access over the Yahoo! index without crawling to a single page. For example, the following URL:

http://search.yahooapis.com/WebSearchService/V1/webSearch?appid=YahooDemo&query=site%3Aseomoz.org%2F&results=2

fetches the first two hits from a Yahoo! [site:seomoz.org]. Interestingly, it tells you where the cache URLs are, and they reside on Yahoo! servers (unsurprisingly). So you fetch the cache from Yahoo!, do the analysis, save what you care about (links, titles, etc), and you’re done.

You’ll need to kick start this somehow with a seed set of sites. DMOZ and Wikipedia are usually good sources that are freely available. Wikipedia can even be downloaded so no one needs to know. Yahoo!’s very own Delicious, Digg, reddit, etc are also good starting points because they tell you what’s hot right now. The seed is basically a huge set of URLs from which you extract the domain names and do [site:domain] queries. Lather, rinse, repeat.

Notice that you won’t need to crawl a single page yourself. You let Yahoo! do the work for you. Neat, no?

So What Should SEOmoz Disclose?

Above I said two potentially conflicting things: SEOmoz should disclose the Linkscape user agent and then went on to show that it doesn’t need to have a user agent. So what exactly am I asking from SEOmoz?

Easy: complete disclosure. If SEOmoz is using a traditional crawler, we must have its UA and the IP addresses. It’s only a matter of time for us to find them. If not, SEOmoz needs to explain clearly why not.

Announcing Cligs: Short URLs with Analytics and SEO Friendliness

That’s right folks, the short URL market is broken and I’m fixing it. The new service is called Cligs (like Clicks but with a G). It’s a short URL service on steroids. The key feature is that it tracks the clicks of the short URLs.

What kind of analytics do you get? At launch right now:

  • Cligs gives you tons of traffic data and analytics about the traffic your short URLs get. This includes:
    • Number of hits
    • Referral stats
    • Mentions on twitter, blogs, and the web
    • Mentions of the destination URL on twitter, blogs, the web, and delicious

    And lots more! And if you want a more data, just let me know!

  • Cligs forwards with a 301 Permanent Redirect so your destination URL gets full SEO benefits of the link. If you are an affiliate marketer, this means you can hide your backlinks, get traffic, get statistics, and get the SEO benefits.
  • With Cligs, you can create an unlimited number of short URLs for the same destination URL. This is great because you can promote the same destination at different sites like twitter or facebook by using different cligs and watch how each source sends you traffic.

That’s just the start. There are a ton of new features that are going to be added in the coming few days and weeks, including some SEO-useful analytics.

And, of course, there is a bookmarklet:

Shorten Link @ Cli.gs

So what are you waiting for? Stop using plain-vanilla short URL services and start using Cligs.

Comments and feedback most welcome.

New Stealth Crawler from Yahoo!

For the past few months, I’ve been tracking a crawler from Yahoo! that does not identify itself on my science blog. The bot’s details are:

Requested page: /science/converting-blood-groups
  • At: 06 May 2008 10:21:05 AM GMT
  • Routed to: /index.php
  • Referred from: http://blogsci.com/science/converting-blood-groups
  • Remote: crawl1.image.srch.kr1.yahoo.com (203.212.174.181)
  • Request: HTTP/1.1 GET
  • Accepting:
    • HTTP: */*
    • Charset:
    • Enconding:
    • Languages:
  • UA:
  • Cookies:

Notice a few interesting details: No user-agent string, the fact it provides an HTTP_REFERER header that’s the same page being requested, it comes from *.yahoo.com not the usual yahoo.net for Slurp, and the fact it says "image" and "srch" in the host.

The tracking is very low-level, a few hits a day with lots of one-hit-a-day visits.

What’s really interesting is how laser-targeted it is: it’s only requested the same two pages many times since May. The pages are the specific blog post linked to above plus the archive page that contains that post, so it’s likely something about that post that’s of interest to the bot. And yes, the post contains an image, and the image is the only one in the main content of the archive.

I’ll dig deeper when I have a chance. Please let me know in the comments below if you’re seeing something similar.

The Ultimate jQuery Development Guide

This has got to be the best jQuery development guide I’ve seen. It’s one of those pages you bookmark or add to your scrapbook for those late night hacking sessions when things go wrong.

Opt Out of Behavioral Ad Targeting by Google/Doubleclick and Yahoo!

Oh yes, finally a way to tell the algo-borgs at Google/Doubleclick and Yahoo! that they should not track your behavior to deliver "more relevant" ads. You do that by visiting a page on each of their websites and click a button which sets a cookie that tells the system to not track your behavior.

Google also links to another page from the Network Advertising Initiative which lists quite a few ad systems you can opt out of.

The pages are:

While I’m at it, does anyone else find Yahoo!’s page to be much better than Google’s? Think about the usability: it tells you if you’ve opted in or out and explains that it’s per computer rather than per user (very important!!!). I’m just saying that as a landing page supposedly to help consumers, Google’s is a mess compared to Yahoo!’s clean and to the point page. The NAI’s is very good too.

Chatting with a Google Street View Driver




Note: some details in this post have been skipped or generalized to be a bit vague to protect the identity of the Google Streeview driver.

Google Street View Car

Sometime in the past few weeks, I was walking with a friend when we spotted a very funny looking car. We both immediately knew what it was and as the car drove closer by, our suspicions were confirmed: it was a Google Streetview car outside London. Feeling naughty, I shouted at the car as it drove by something along the lines of "there are privacy laws" and to my surprise an old man across the streed did the same! It was very funny how both of us knew what a Streetview car looked like!

Then it hit me: the road we were on that the car was driving into was a dead end road. Picture time! So I dropped my stuff and asked my friend to watch them while I set up my phone and found a good spot to take some photos as the car drove back out again. So I watched as the car reached the end, did a U-turn and drove back out again. However, as it got close to me, the car pulled up into an empty parking spot and the driver came out. He shouted at me saying "I know you want to take pictures but I don’t want to be in them." I obliged.

While taking the photos, I talked to the driver a little bit. Here are some details from the notes I scribbled afterwards:

  • Google has a centre in Milton Keynes where this operation was based in. The drivers just showed up for "a driving job" (his words) and didn’t know it was for Google until the arrived to pick up the cars.
  • The drivers were given training to use the computers inside the car. It’s not hard: it’s a large-ish touch screen (I guessed about 17in or maybe a 19in when I saw it) with a record and a pause button.
  • The screen is to the left of the driver in the passenger seat with a large server at the back in the trunk. The back seats of the car were removed – it was just a big space. The connections into the server were just power and ethernet. The ethernet seemed to be going up to the camera but I’m not sure if it ran to something else.
  • The camera is rain sensitive. It collapses in a very funky way and has to be covered. The drivers are under strict instructions to do so.
  • This particular driver was very sensitive to the privacy issues. He was having a personal conflict about the whole thing and was stopped by (his words) "10 people" that very day. Why? Because only recently had the BBC published an article about Google Streetview starting with Google’s plans to launch a mapping tool in the UK could be referred to the Information Commissioner". No wonder the driver didn’t want to be in the photo!

Now some photos of the car with notes:

Google StreetView car, front view

The car from the front.

Google StreetView camera

The car’s camera. The hexagon Octagon at the top is I think is the camera set itself (so 6 8 cameras in total). The yellow box seems to be the communication/processing circuitry; the yellow box is on the back side of the car and so the white box thing at the right hand side of the image points towards the right of the car. This white box thing seems to swivel up and down but this is just a wild guess.

Google StreetView car, back view

The car’s camera kit as seen from the rear of the car. Just guessing what each bit is: Yello box at the top, as above. White boxes to the left and right are the (potentially) swiveling bits – could they be cameras? The yellow disk at the bottom: a wireless communications dish? It could be a GPS receiver.

Update: Looking through some of the other images I had after someone dropped a hint on GTalk to me, the white boxes under the hexagon of cameras are laser range finders. Sure enough, I have a photo that has a warning that it’s a "Class 1 Laser".

Update 2: Thanks for all the comments. Yes I couldn’t count: there are 8 cameras not 6; that’s fixed now. Also, a lot of people wrote about the type of laser range finder and why you’d need it – see the comments below. Finally, lots of people noted a certain irony in the driver not wanting to be photographed. Point taken, but the guy was very conflicted about it. The BBC article was still in memory and clearly some people like me caused his some fuss on that day. He was talking a lot about wanting to quit this job. Deep down I think he did but of course I cannot know.

Update 3:Yes some rain droplets is visible in a photo. It wasn’t raining while we were talking but it had rained earlier that day. When the driver parked, the camera hit some trees (you can see that in the photos) and the droplets are from the tree. It’s hard rain that gets the equipment as I understand it, and that’s when the drivers are supposed to cover up.

Twitter Bug: View Friend-Only Private Updates

On twitter, I’m following someone who I cannot un-follow due to a bug in Twitter. Why? Because said person changed their settings I’m only giving updates to friends – I see the message "I’m only giving updates to friends.". Visiting the person’s home page, I cannot see the Follow/Unfollow button because the interface only lets me ask the person to allow me to see his updates.

But I can easily see his updates.

Here is how: browse twitter using a mobile phone. Yes the mobile interface shows you these "private" updates but the web interface shows me the message "I’m only giving updates to friends.". I discovered this bug by accident while browsing using my mobile phone, but using a couple of extensions, you can pull off this trick in Firefox.

The screenshot below illustrates the bug. It’s basically the mobile version and the full normal version of twitter side by side. The lines map corresponding updates, with the yellow/orange one highlighting the bug.

Twitter bug showing private updates

Download full sizes of the screenshots used to make the image above:

I’ve filed a bug report with twitter.

What do you do with Unauthenticated Search Engine Bots?

Over at Search Engine Journal, Ann Smarty explains how to switch your UA to Googlebot and browse the web. The technique uses a Firefox extension to change the user agent string to that of Googlebot. Simple and works a treat. Except for…

The problem here is that it is very easy to authenticate Googlebot, Slurp, or MSNBot. The three major search engines give us a double-DNS trip to check whether a request pretending to be one of their crawlers is genuine or not. The authentication helps us webmasters fight against crawlers (not to mention other things ;) ). So the SEJ article is useful but it’s not 100% foolproof and people pretending to be GBot/Slurp/MSNBot will probably get trapped with snares laid by clever webmasters.

This raises an interesting question: If you do authenticate SE bot requests, what do you do with unauthenticated ones?

Personally, I just block all unauthenticated bots. The request is served with a blank page without any content. I’ve found that this helped stop *all* (yes all) unauthenticated bots but with proportional rise in more sleuthing bots (i.e. scrapers pretending to be a browser). No matter, this is an arms race and I’m in it for the long-run.

Other people suggest you should feed unauthenticated requests with content that AdSense frowns upon like guns or porn. The idea is that these crawlers are out to get your content for MFA sites and so it’s best to get them banned the quick and dirty way.

Others suggest just ignoring them; after all, they’ll come back with a different UA anyway, so what’s the point? This attitude bothers me because it just means giving up and letting your content get scraped far and wide without any control.

So what do you do with unauthenticated bots and more generally, what do you do with bots?

Stop Competitors from Stalking Your Website Using AdWords


Regular readers will know that I like to gaze at my log files in search of life-changing inspirational moments. Well I have another such gem of an inspiration for you: figuring out if someone is stalking your website using the Google AdWords keyword tool and how to stop them.

When someone goes to the AdWords keyword tool and asks for keywords based on the contents of a web page (the "Website content" option), Google actually requests the page live. This request shows up in the logs and can of course be blocked. The details are:

Referred from: (No referer.)
Remote: 74.125.16.37
Request: HTTP/1.1 GET
UA: Mozilla/5.0 (compatible; Google Keyword Tool; +https://adwords.google.com/select/KeywordToolExternal)

So what to do? Be careful blocking the IP addresses as a general precaution against stopping legitimate requests from Google IP addresses (Googlebot, Google’s Feedfetcher, etc). However, the user agent is a good tell-tale sign and is ripe for blocking.

So: aim… fire!

Fire what though? A simple block? Nah, not much fun that. Knowing full well that only competitors will use that service to check out which keywords your pages might rank for, I would feed the requests dud content. Lorem ipsum anyone? How about random content about keyword theft? Here is an SEO exercise for you: which keywords can you get the Adwords keyword tool to show about your pages? To rephrase: what keywords can you "rank" for in the tool?

And don’t forget to go back into your logs and see how many times people have stalked you.

New Word for Spam: Linkosphere

Yes, that’s right folks. Step right up. We have a new buzzword to hide the fact that we’re scraping content and sending trackbacks to the original content. The new word is… Linkosphere.

So, pray do tell us Pierre, where would you come up with such a silly name? Why I’m glad you asked. It’s the service that’s been spamming me blog for the past few months, hosted at the one and only ectio dot us. See, them scrapers have a serious claim: "Find something to read, guaranteed!" I believe them given all the scraping they’re doing.

And thus because I am in the mood to return them the favo(u)r, I hereby declare them the prototypical scraposphere service. Beat that!

What is YahooCacheSystem?

I just started noticing some hits coming from a few *.yahoo.net IP addresses with a user agent of just "YahooCacheSystem" and requesting only the raw RSS XML feed so far. All requests are HTTP/1.0 GET, setting the HTTP_ACCEPT to */*. No other headers are set.

The first hit I’ve seen was on April 27th, which came from the IP address 216.39.58.78. Back then, that resolved to htproxy3.ops.re4.yahoo.net. However, ever since, the hits are all from a different C-block, 209.131.41.*, which resolves variously to, htproxyX.ops.sp1.yahoo.net (X is a number like 1 or 2 to give htproxy1.ops.sp1.yahoo.net or htproxy2.ops.sp1.yahoo.net). Even more recently, the IP addresses remained the same, but the hosts they resolve to changed to htproxyX.ops.re4.yahoo.net (again, X is a number to give htproxy1.ops.re4.yahoo.net or htproxy2.ops.re4.yahoo.net).

I post about this bot for one simple reason: the UA is very intriguing and the fact that it’s requesting just RSS XML feeds is also interesting. Are we going to see a Yahoo! service or a set of services that deal with just blogs?

TechCrunch reported way back in 2005 about the launch of Yahoo! blog Search, which back then and today has pointed to what Yahoo! calls the News Search, which according to the web page is to "Search real-time news stories from Yahoo! News and across the web." That’s fine and dandy, but it’s no blog search per se.

So the YahooCacheSystem bot could represent one of two things:

  • Yahoo! is consolidating its backend infrastructure to deal with RSS-based sites better. So they are building a centralized RSS cache for all their services to use. For webmasters, this means we now have a new analytics data point we can look at.
  • Or… (wait, I need peer at my crystal ball…) Yahoo! is moving towards building a serious set of services centred around XML feeds. This could mean we could see a true blog search product soon, or something else we can only guess at.

So which one is it? I can only provide guesses. Given the utter lack of evidence and, more importantly, rumors, I’m leaning towards the infrastructure explanation. However, a good infrastructure is necessary for a major strategic shift or product launch. Time will tell.

Live.com Spambot Ignores robots.txt

Oh, MSNbot, when will you ever learn? I won’t rehash the story that lead me to blocking MSN’s referral-spamming bot, and that seems to have worked a bit. The problem is that the referral spam is still coming in! Yes, MSNbot is blocked but the spammy hits are still coming in.

Case in point, this hit from today over at Social Alerter:

/tips/how-not-get-dugg
  • At: 19 April 2008 11:04:39 AM GMT
  • Referred from: http://search.live.com/results.aspx?q=alerts&mrt=en-us&FORM=LIVSOP
  • Remote: livebot-65-55-165-107.search.live.com (65.55.165.107)
  • Request: HTTP/1.0 GET
  • Accepting:
    • HTTP: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
    • Charset:
    • Enconding:
    • Languages: en-us
  • UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
  • Cookies:

Is it just me or is this beyond comical now?

Killing Live.com Bot


I’ve had it. The live.com spambot, aka msnbot, is officially not welcome either here or at Social Alerter. Why? The bot is still referral spamming. How much? 100% of my live.com referrals at Social Alerter are actually the bot’s spam. Granted the absolute number of hits is only in the low tens, but it is not right and such behavior is no longer welcome. And no, the constant lies that this behavior has stopped do not help.

For a background on this, start here, then read this post, and close off with the follow up.

Bye, bye. I hope to see you never.

« Previous Entries