Surge in Spam Bots

Anyone else noticing a huge spike in spam bots? My blocker is stopping record numbers of page requests from spam bots, about 10-15 times the usual rate. It started about 10 days ago. Thinking out loud: is this related to the holidays? Are spammers expecting webmasters to be sleeping so they set their bots to scrape now?

Anyone else seeing this?

Technorati Tags: ,

Link to Me Wordpress Plugin

I was browsing around Jim Westergren's website this morning and found a little gem: the Link to Me Textbox WordPess Plugin. Its task is simple enough: prod your visitors to link to the post and make it easier to do so at the same time. It displays a short text to the effect of 'if you like it please link' below which is shows a textarea with the link HTML all ready for copy/paste. Nice and simple, and could be an integral part of any SEO link building campaign.

So now it's implemented here on things of sorts. I didn't like the default HTML (not just the wording) it produced, but thankfully, the plugin is very easy to edit. Once I tweaked it, I modified the CSS to merge it with the blog's theme.

Thanks, Jim! And if you don't know about Jim's website, have a look. It's a nice little gem on the web.

Technorati Tags: , , ,

Everything You Want to Know About Writing SEO Tools - Part I

Over the past few months, I've written a few SEO tools. The first ones were, how should I say this, 'improvable', while some of the later ones are fairly popular now. This post is a digest of some of the things I've learned.

Get a good idea

We already have dozens of backlink analyzer, keyword analyzer, etc etc, so what new thing will you add to this huge pile? There are so many tools out there, and so many people writing SEO tools, that it's hard to come up with something new. But it's still possible. I never actively think about an idea, but just keep an open mind while reading forums, blogs, articles, etc. Keep an eye out for 'wishful posts', things that say "wouldn't it be nice if we had..." or "I could find anything that does..." or "all the ones I tried were bad". Those should seed an idea in your head.

Another source of ideas is your work. While I use other tools or try to do something, I sometimes find myself wishing to automate a task or going through a long process to get some a new look at the data. Those are tool ideas just waiting to be implemented! If you work in a company, ask your colleagues. For the Domain Geolocator, the idea evolved in a forum thread that went on to further refine the idea collectively: what do wewant? What do users want? What is missing from the current tools you are using? Remember the old adage about giving users what they want?

Write a plan and the feature specifications

This is probably the most important step, especially for complex/large SEO apps. The feature specs allow you to prioritize development, see which features can interact, allows you to think of other features to add, which features should be removed, and more. Also, it gives you a non-moving target to aim for.

The plan is slightly different: it's how the tool works. You draw boxes and lines and circles describing where data is stored, how it's retrieved, what database tables are needed, any settings, cookies, etc. Clearly, this overlaps a lot with the feature specs.

Know a programming language inside-out

Choose one or two languages and learn them inside-out, left, right, and center. The more you know how to program using a language, the easier development becomes. Write programs just for the sake of writing a program. Try to answer technical questions: which construct is faster? Is a database faster than a well-written text-file storage? Try to work with large data. Have you ever parsed text files tens of megabytes in size? Geeky? Sure, but you'll learn a lot.

For me, this langauge that I know well is PHP. I don't claim to know it inside out, but I do experiment and play with it for the sheer value of learning.

And yes, this takes a lot of time and experience.

Know HTML & CSS well

This might surprise you, but HTML/CSS is very important because it's your tool's interface: how it takes input and prints output. The more well-versed you are with HTML and CSS, the better the results. So many tools out there have crappy interfaces, and your tool's interface might make it a winner. Keep in mind the usability and functionality of your tool.

This is it for Part I. Keep an eye out for Part II soon with an in-depth technical discussion! Comments and questions below are very welcome :)

Extracting Data from Text Databases

With AOL (not-) releasing 2GB of data in text files, it is a good time to brush up or learn how to extract data quickly from text files. IBM has a new tutorial about command-line tools you can use to manipulate text files.

Although the tutorial talks about this as a Linux thing, it's not really. You can have most of the tools under Windows using the GNU utilities for Win32 collection. If you really want Linux anyway, you can run it within Windows using VMWare Player (free) and downloading an operating system image for it. I use Ubuntu and PC-BSD, but you can choose from quite a few operating systems. This will set up a full Unix environment inside Windows... cool! You'll need ample RAM and a good processor.

Technorati Tags: , , , ,

Last Major Web Rendering Engine Coming to Windows

It took a while, but WebKit, the rendering engine of Apple's Safari is coming to Windows thanks to the new GetWebKit project. WebKit is open source as it is based (and feeds code back into) the KHTML project (see Wikipedia for more info).

This is the final major rendering engine to make it to Windows, making web design testing a whole lot easier. It's still in alpha - i.e., not considered even half-stable to be called beta - so be careful if you download it! Soon though...

Technorati Tags: , , , ,

Digg Publishes API

I can't believe I missed this! Digg has just published its API specs. Nothing unexpected features-wise, but still lacking many of the features of the del.icio.us API. Come on Digg!

Technorati Tags: , ,

The Behavior of the Yahoo! Bots - Part II

In The Behavior of the Yahoo! Bots - Part I, I talked about an in-depth analysis of visits from Yahoo! domains to eKstreme.com in April. That first post talked about the user agents used. This post will discuss how the bots accessed the pages in more detail.

Yahoo! Slurp

The first thing I looked at is which remote hosts Slurp used. There was a simple pattern: there were three 'domain series' that requested pages while identifying themselves as Yahoo! Slurp:

urlc*.mail.mud.yahoo.com

cdev*.yst.corp.yahoo.com

echtest*.yst.corp.yahoo.com

In all cases, the * was a number. I could not see a pattern to assign function to each series. Remember, that these are not where the majority of Slurp hits came from: Tens of thousands of requests came from *.inktomisearch.com, which seems to be the main Slurp domain.

Slurp also has an annoying behavior: I found many example when Slurp would request the same page multiple times (2-7) within a 5-10 second time span. I even found examples of multiple requests per second. I wonder why Slurp does that. Is Yahoo! trying to gauge how good a site is under stress? Is it a buggy Slurp? I don't know.

Slurpy Verifier

This one is new to me. It looks likes a bot: no referring links, same IP address for each request (rdev25.yst.corp.yahoo.com), and, requesting only a single page off this blog. Others have observed identical behavior (see yeraze.com, The Hive Archive, and Ham Radio Blog). It hasn't done anything particularly annoying (or useful, for that matter) so far, but we should keep an eye on it for future reference.

Scooter

Scooter was Altavista's bot, and it seems it's still alive. It requested pages four times, three of which were for the guestbook. I wonder if it's scraping links or emails ;).

One notable feature of Scooter is that it sets the HTTP_REFERER header to be equal to the page it's requesting. I'm seeing more bots send a referring link header lately, so maybe this is a trend that's emerging.

Conclusions

This pretty much sums it up. The Yahoo! bots are quite active and diverse. Given the recent Yahoo! index updates (after months of stagnation), it's refreshing to see Yahoo! take such an interest in websites. I actually decided to look at the Yahoo! bots since Yahoo! started sending me significant traffic; only a few months ago, Yahoo! couldn't have cared less about eKstreme.com.

And in closing: if anyone has any more info about any of those bots, please drop me a line.

Technorati Tags: , , , , ,

The Behavior of the Yahoo! Bots - Part I

I decided to look at how Yahoo! crawled my site in April. I've found some very interesting things. Some findings were new to me, but some were old news.

Methodology

Using the PHPCounter website analytics program, I searched for all hit in April whose remote host contained the word 'yahoo'. A total of 378 hits were found. Note that this will not detect all hits from Yahoo; indeed, the number of visits by the Yahoo! Slurp bot was much, much higher.

The data was the log files of eKstreme.com.

User Agents

I was surprised by the diversity of user agents that I found. The 'standard' Yahoo! Slurp user agent was the most active bot, and it was identfied as:

Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

No surprises there really. Yahoo! also have the Yahoo! Seeker bot, and indeed, I got a few hits from it. However, I found two user agents for it:

YahooSeeker/1.2 (compatible; Mozilla 4.0; MSIE 5.5; yahooseeker at yahoo-inc dot com ; http://help.yahoo.com/help/us/shop/merchant/)

DoCoMo/2.0/SO502i (compatible; Mozilla 4.0; MSIE 6.0; yahooseeker-jp-mobile AT Yahoo!JAPAN)

There was only a single hit from the second one, and it was from *.yahoo.co.jp. Localized mobile searching anyone?

Some of the more interesting visits from Yahoo domains had the following user agent:

Scooter/3.3

Scooter was the bot for AltaVista, which got bought by Overture, which Yahoo! acquired. It seems that Scooter is still alive, as evidenced by the occasional visits. Incidentally, Scooter sends the HTTP_REFERER header, and sets it to be exactly the page it is requesting; that is, it pretends the page is referring itself.

I also found a new (to me, at least) user agent:

Slurpy Verifier/1.0

It accessed only a single page from this blog (the .htaccess and Worpress entry) about 10 times. Why this particular interest, I have no idea.

One I expected to see was the Yahoo! Blogs bot. Indeed I found a few hits from it, but some of the hits were to non-Blog pages. The pattern seems that these non-blog pages that it requested were linked to from blog pages. I am not sure how generally accurate this observation is, but it is the pattern for now. The user agent was:

Yahoo-Blogs/v3.9 (compatible; Mozilla 4.0; MSIE 5.5; http://help.yahoo.com/help/us/ysearch/crawling/crawling-02.html )

Finally we come to the biggest two surprises: Yahoo user agents that do not identify themselves as Yahoo! bots. The first one is:

Nokia6600/1.0 (4.09.1) SymbianOS/7.0s Series60/2.0 Profile/MIDP-2.0 Configuration/CLDC-1.0

That sounds like a mobile phone (cellular phone for our American friends :) ). Was it a bot or someone using a phone to connect to eKstreme.com via the Yahoo! network? I think it was a bot because:

  • None of the hits had a referring URL.
  • All hits were to the eKstreme.com root index.
  • They all came from an IP address that resolves to test02.wap.search.scd.yahoo.com. The 'test02' bit suggests a testing site.
  • A total of 7 hits were requested, all occuring in a 3-second span. (Talk about stressing a website!)

So what's up with that Yahoo!?

The second big surprise were two, identical, hits. Both hits came from morgue2.corp.yahoo.com, and both had the following user agent:

Mozilla/4.05 [en]

This one is a bit too interesting for my taste: why would anyone call a server 'morgue', let alone have two of them (as in 'morgue2'), and why the very plain user agent? Too many questions.

Just for the sake of completeness, I want to note that I found quite a few 'normal' browser user agents, and they followed behavior expected from human surfers: they had real referring links, were clearly visits to several related pages (like using a tool and following through every page of the analysis), and had expected user agent strings.

In short, there seem to be quite a few Yahoo! bots around, which raises the simple question of 'why?' What does each bot do? Why does Yahoo! need them? Fighting spam might be one reason: Yahoo! may be sending diverse user agents in order to discover cloaking based on user agents. This is not that effective as clearly all hits were identifiable as coming from Yahoo!

On to The Behavior of the Yahoo! Bots - Part II

Technorati Tags: , , , , ,

.htaccess and WordPress

When installing WordPress for this blog, I stumbled across this problem: I couldn't get 301 redirection to forward the www.ekstreme.com version to the ekstreme.com version of the site if accessing the /thingsofsorts/ directory. After some playing, this is how to do it.

The default .htaccess that comes with WordPress is:

RewriteEngine On RewriteBase /thingsofsorts/ RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule . /thingsofsorts/index.php [L]

The simple modification is this:

RewriteEngine On RewriteCond %{HTTP_HOST} ^www\.ekstreme\.com [nc] RewriteRule (.*) http://ekstreme.com%{REQUEST_URI} [R=301,L] RewriteBase /thingsofsorts/ RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule . /thingsofsorts/index.php [L]

Nice and simple. You can read more about 301 redirection and SEO in my 301 redirection tutorial; it even includes code examples!

Technorati Tags: , , ,

Visual Studio Express free forever

Microsoft just announced that Visual Studio Express will be free permanently. Web developers a special web development edition, and lots of tutorials. Anyone up for developing websites and desktop-based SEO tools? I sure am!

Technorati Tags: , , ,

  Next Entries »

Site Navigation

Blog Categories

Popular Pages

The most popular pages on eKstreme.com.

Search

Subscribe

Subscribe to RSS 2.0 feed

Community

 
thermodelly