If you're part of the SEO industry, unless you've been livining under a rock for the past couple of days, you will know that SEOmoz launched a new tool called Linkscape, to much fanfare. First things first, congrats and kudos are due to the SEOmoz team for building such a complex beast. It's not easy at the very least on the technical level.
But there is a problem: SEOmoz has not disclosed the user agent (UA) of its crawler. Here I will talk about why this is a bad thing, and also take a stab and go out on a limb and say: there is no SEOmoz crawler, at least not in the traditional sense. For the latter, I will offer a viable technical alternative, which may or not be correct, but the fact the alternative exists gives a sensible explanation as to why SEOmoz is not offering a straight answer to the UA question.
Why Disclosing the UA is Essential
Let's not mince words: we as an SEO community like a little mud fight once in a while. We debate and discuss and yes fight. But one thing we all know how to recognize is malicious activity and differentiate it from aggressive activity.
Example: a bot scraping our content for an MFA site is a tolerated nusance. We take steps to negate the effects of scrapers but at the end of the day we don't fight them hard. On the other hand, a bot probing for security holes is treated like a witch in 1209AD.
Which is why the Linkscape's lack of disclosure hurts: We as a community work hard at identifiying bots. SEOmoz is supposed to be a good citizen of the SEO world, and yet the lack of transparency goes against the spirit and the image of SEOmoz. On the one hand we have a company with a strong community doing good deeds (SEO trademark fight anyone?) and yet it behaves in a way we expect out of the shady side of the net we deal with every day.
Not just that: the data collected from us, about us, will be used against us. It's called competitive intelligence.
And not just that: SEOmoz is using the data to make money. The free version is pathetic and the Pro version needs a monthly subscription.
To me, this kind of behavior (stealth, harmful, and to make money) puts Linkscape squarely in the naughty corner. I certainly didn't expect this out of SEOmoz. Tough luck Rand and co: you have a great brand and I for one expect better!
But I won't ask for a UA because I think there isn't one.
How To Build Linkscape
It's actually quite easy on a conceptual level. However, just like cooking, having a recipe doesn't make you a great chef - there are lots of details that SEOmoz must have tackled successfully to build Linkscape. I am not trying to belittle their achievment, and all I can show you is one recipe. This recipe is completely my guess and could very well be wrong. I have not talked to anyone at SEOmoz.
So come on Pierre, what is it? The answer is the Yahoo! Search API. It's an API giving programmers complete access over the Yahoo! index without crawling to a single page. For example, the following URL:
http://search.yahooapis.com/WebSearchService/V1/webSearch?appid=YahooDemo&query=site%3Aseomoz.org%2F&results=2
fetches the first two hits from a Yahoo! [site:seomoz.org]. Interestingly, it tells you where the cache URLs are, and they reside on Yahoo! servers (unsurprisingly). So you fetch the cache from Yahoo!, do the analysis, save what you care about (links, titles, etc), and you're done.
You'll need to kick start this somehow with a seed set of sites. DMOZ and Wikipedia are usually good sources that are freely available. Wikipedia can even be downloaded so no one needs to know. Yahoo!'s very own Delicious, Digg, reddit, etc are also good starting points because they tell you what's hot right now. The seed is basically a huge set of URLs from which you extract the domain names and do [site:domain] queries. Lather, rinse, repeat.
Notice that you won't need to crawl a single page yourself. You let Yahoo! do the work for you. Neat, no?
So What Should SEOmoz Disclose?
Above I said two potentially conflicting things: SEOmoz should disclose the Linkscape user agent and then went on to show that it doesn't need to have a user agent. So what exactly am I asking from SEOmoz?
Easy: complete disclosure. If SEOmoz is using a traditional crawler, we must have its UA and the IP addresses. It's only a matter of time for us to find them. If not, SEOmoz needs to explain clearly why not.