CrawlerController: referral spam blocking script
Update: CrawlerController 1.1 released. It fixes a bug in remote host detection. Download Crawler Controller 1.1.
Referral spamming: the problem
Referral spamming (also known as log spamming) is a dubious marketing technique where web bots visit pages using spoofed HTTP_REFERER headers. A quick Google search for referral spamming shows that this a well-documented problem, especially hitting blogs.
The problem of referral spamming is: bots visit your site pretending to be coming from a certain advertising page. Some websites (particularly blogs) publicly display the 'latest 10 referrers' list (or similar). These links are part of the page, which gets indexed by search engines. Thus to search engines, your site (along with many, many others) is linking to these fake referrers, which boosts the fake referrers ranking in search engines.
If you want to track spammers on your website, use PHPCounter to get detailed web site statistics.
Referral spamming: the solution
To use CrawlerController to its full potential, please also read the 'CrawlerController in full technical detail' section below, but you don't have to!
How CrawlerController works to fight referral spamming
CrawlerController checks four different features of a page request: the referring link, the user agent, the remote IP address, and the remote host. The checks are for known spam signs, and you can modify them.
If CrawlerController detects a potential spam hit, it displays an error message asking the user to enter a word. This is called a captcha. You can change this behavior, as described below.
Installing and using CrawlerController
Download and unzip the CrawlerController distribution zip file into a directory on its own on your server.
For each page you want CrawlerController to defend, add the following code at the very begining of the page before any other content...
<?php require "/full/path/to/crawler-controller-dist.php"; ?>
...changing the path to point to the correct path to the CrawlerController script. That's it!
Any problems, questions, or suggestions for new blocks, please email me.
Adding more blocks
To add more blocks, edit the following files:
- crawler-controller-ref.txt: One keyword or phrase per line to check referring links. (Actually, it is one string per line you want to match.)
- crawler-controller-rem.txt: One remote IP address per line you want to block per line.
- crawler-controller-remhost.txt: One remote host (domain name) you want to block per line.
- crawler-controller-ua.txt: One keyword or phrase per line to check the user agents. (Actually, it is one string per line you want to match.)
All files are simple text files, and the entries are case insensitive.
Any problems, questions, or suggestions for new blocks, please email me.
CrawlerController in full technical detail
If CrawlerController detects a potential spam hit, it raises the alarm by setting two PHP variables: the first one is called $CCRecommendsBlock, and it is set to TRUE; the second is $CCBlockReason, and it contains a short sentence about why CrawlerController thinks this is a hit. A typical reason is "HTTP_REFERER matches: diet HTTP_REFERER: ...". Thus by default, CrawlerController only informs you of a potential spam hit.
The two variables set by CrawlerController allow you to make a decision; options include:
- Simply check if $CCRecommendsBlock is TRUE and then block the request by using the exit() PHP function.
- If $CCRecommendsBlock is TRUE, output a header, like a '403 forbidden' or a '404 not found' to tell the script to go somewhere else.
- If $CCRecommendsBlock is TRUE, you can ask for a captcha to confirm whether the hit is from a bot or not. This is the option I am using on eKstreme.com.
Basic decision making code is bundled with CrawlerController, but it is recommended that you modify it for your website. To modify CrawlerController's behavior, edit the crawler-controller-dist.php file: at the very end of the file is the PHP code you need to edit.
Referral spamming: further reading
If you are interested in reading more about the problem, these resources should get you started:

