The SEOmoz Linkscape Ghost
If you're part of the SEO industry, unless you've been livining under a rock for the past couple of days, you will know that SEOmoz launched a new tool called Linkscape, to much fanfare. First things first, congrats and kudos are due to the SEOmoz team for building such a complex beast. It's not easy at the very least on the technical level.
But there is a problem: SEOmoz has not disclosed the user agent (UA) of its crawler. Here I will talk about why this is a bad thing, and also take a stab and go out on a limb and say: there is no SEOmoz crawler, at least not in the traditional sense. For the latter, I will offer a viable technical alternative, which may or not be correct, but the fact the alternative exists gives a sensible explanation as to why SEOmoz is not offering a straight answer to the UA question.
Why Disclosing the UA is Essential
Let's not mince words: we as an SEO community like a little mud fight once in a while. We debate and discuss and yes fight. But one thing we all know how to recognize is malicious activity and differentiate it from aggressive activity.
Example: a bot scraping our content for an MFA site is a tolerated nusance. We take steps to negate the effects of scrapers but at the end of the day we don't fight them hard. On the other hand, a bot probing for security holes is treated like a witch in 1209AD.
Which is why the Linkscape's lack of disclosure hurts: We as a community work hard at identifiying bots. SEOmoz is supposed to be a good citizen of the SEO world, and yet the lack of transparency goes against the spirit and the image of SEOmoz. On the one hand we have a company with a strong community doing good deeds (SEO trademark fight anyone?) and yet it behaves in a way we expect out of the shady side of the net we deal with every day.
Not just that: the data collected from us, about us, will be used against us. It's called competitive intelligence.
And not just that: SEOmoz is using the data to make money. The free version is pathetic and the Pro version needs a monthly subscription.
To me, this kind of behavior (stealth, harmful, and to make money) puts Linkscape squarely in the naughty corner. I certainly didn't expect this out of SEOmoz. Tough luck Rand and co: you have a great brand and I for one expect better!
But I won't ask for a UA because I think there isn't one.
How To Build Linkscape
It's actually quite easy on a conceptual level. However, just like cooking, having a recipe doesn't make you a great chef - there are lots of details that SEOmoz must have tackled successfully to build Linkscape. I am not trying to belittle their achievment, and all I can show you is one recipe. This recipe is completely my guess and could very well be wrong. I have not talked to anyone at SEOmoz.
So come on Pierre, what is it? The answer is the Yahoo! Search API. It's an API giving programmers complete access over the Yahoo! index without crawling to a single page. For example, the following URL:
fetches the first two hits from a Yahoo! [site:seomoz.org]. Interestingly, it tells you where the cache URLs are, and they reside on Yahoo! servers (unsurprisingly). So you fetch the cache from Yahoo!, do the analysis, save what you care about (links, titles, etc), and you're done.
You'll need to kick start this somehow with a seed set of sites. DMOZ and Wikipedia are usually good sources that are freely available. Wikipedia can even be downloaded so no one needs to know. Yahoo!'s very own Delicious, Digg, reddit, etc are also good starting points because they tell you what's hot right now. The seed is basically a huge set of URLs from which you extract the domain names and do [site:domain] queries. Lather, rinse, repeat.
Notice that you won't need to crawl a single page yourself. You let Yahoo! do the work for you. Neat, no?
So What Should SEOmoz Disclose?
Above I said two potentially conflicting things: SEOmoz should disclose the Linkscape user agent and then went on to show that it doesn't need to have a user agent. So what exactly am I asking from SEOmoz?
Easy: complete disclosure. If SEOmoz is using a traditional crawler, we must have its UA and the IP addresses. It's only a matter of time for us to find them. If not, SEOmoz needs to explain clearly why not.
Subscribe to Things of Sorts
If you liked this post, please subscribe to the Things of Sorts RSS feed: ![]()

October 8th, 2008 at 1:47 am
I asked if they were doing this, and SEOmoz said that this was potentially unreliable. Besides, it wouldn’t give you nofollow data etc., so you’d still have to find that out yourself.
October 8th, 2008 at 2:13 am
Well, that would explain why it took ‘em a year and only a few guys.
October 8th, 2008 at 4:03 am
Hey Pierre - great to hear from you. Sorry you’re not a fan of the tool. Personally , I love it - I think it’s the best thing we’ve ever produced, and something everyone in SEO deserves to have - the ability to see the web’s links the same way the search engines do.
I’m on the road at the moment, and have literally no time (expo and conference all day, meetings, dinners, etc. all night - and my new bride is here with me, so I can’t just sneak away all night and be online). My email is overloaded and I know I should be responding to things and certainly want to.
In any case - we are going to offer a way to block data from appearing in Linkscape ASAP - just need a chance to catch my breath, sit down with the dev team (who’s also crazy busy trying to support the product and handle the launch) and work out a plan.
Thanks for keeping watch on us - we’ll do our best not to disappoint.
October 8th, 2008 at 6:57 am
*** Wikipedia can even be downloaded ***
So can ODP. Check out their RDF file.
Interesting guess on the Yahoo API usage, but very unlikely. The data returned from SEOmoz is different to that from Yahoo, for sites tested so far.
October 8th, 2008 at 7:41 am
Thanks for commenting everyone.
@Gab: The cache will contain the rel=”nofollow” code, so your parser will need to check for it. It’s not unreliable at least on this front.
@Rand: I didn’t say I’m not a fan of the tool - I’m really worried about how it is being positioned in the community. Read my post again and see how much I admire the technical skills (at the very least!) that went into making Linkscape. Very rarely does a tool make me go “wow” and Linkscape did.
I just want better community interaction. Google gave us Matt Cutts and John Mueller because they understood that their index has a big effect on lots of people.
@g1smd I know of the WMW discussion thread featuring wowrack and DotBot. The DotBot index could be the seed: its size according to their website is only a few million pages, not 30+ billion. I’m sure there are other indices, so the data could be a hybrid.
And you’re right about ODP. I knew that but in my haste last night to get this up I seem to have forgotten that.
Pierre
October 17th, 2008 at 6:05 pm
[...] information on everybody’s websites without anyone noticing what they were doing. There was quite a bit of hoopla over the fact that when they announced their new index of 30 billion web pages (and the new tool [...]