PHP Search Engine Bot Authentication

How do you authenticate search engine bots? In the good old days, you simply checked the HTTP_USER_AGENT header that every bot sends to your server when requesting a page. However, this is open to abuse as anyone can look up what the user agent for a search engine bot is and use it. You can try it yourself: if you're using Firefox, get the User Agent Switcher extension and have a play.

So with this abuse, what's the alternative? DNS checking. The idea is simple: each page request is from a specific IP address. You create a script to check which hostname the IP address is associated with. Once you have that, you then resolve the hostname you found back to an IP address. This reverse DNS lookup followed by a forward DNS lookup loop should yield the same IP address as the original requesting IP address. If it doesn't then you have a spammer on your hands - block them!

PHP Code

So how do you do this on the server? It's very easy with PHP.

  • Check the user agent to see if it's identifying itself as a search engine bot
  • If so, get the IP address requesting the page
  • Reverse DNS lookup the IP address to get a hostname
  • Forward DNS lookup the hostname to get an IP address

The code:

$ua = $_SERVER['HTTP_USER_AGENT'];
if(stristr($ua, 'msnbot') || stristr($ua, 'googlebot')){
//it's pretending to be MSN's bot or Google's bot
$ip = $_SERVER['REMOTE_ADDR'];
$hostname = gethostbyaddr($ip);
if(!preg_match("/\.googlebot\.com$/", $hostname) && !preg_match("/search\.live\.com$/", $hostname)){
//the hostname does not belong to either live.com or googlebot.com.
//Remember the UA already said it is either MSNBot or Googlebot.
//So it's a spammer.
echo "Please leave";
}
else{
//Now we have a hit that half-passes the check. One last go:
$real_ip = gethostbyname($hostname);
if($ip != $real_ip){
//spammer!
echo "Please leave";
}
else{
//real bot
echo "Welcome!";
}
}
}

The functions used in the code are links to php.net for you to read more about them. Also, the comments tell you what's going on in the code. Notice that we do a case-insensitive check for the user agent using to see if it's MSNBot or Googlebot. If it is, then we do the DNS check, and check the results.

Two final comments on the preg_match checks. They simply check that the $hostname string actually ends with either live.com or googlebot.com. If not, then we caught a spammer. If the $hostname does indeed end in live.com or googlebot.com, then we either have the genuine article or someone is messing with our DNS. This last possibility takes us to the final check in the else block.

The other thing is that we're doing a negative check, that is, checking that preg_match does NOT match live.com or googlebot.com. We can implement the code with a positive check (i.e., checking that preg_match does match), but of course, the actual code logic changes a bit. Either way works :)

That's it really!

Cloaking

Of course, this makes cloaking (especially of the black hat variety ;) ) very easy.

Leave a Reply

You must be logged in to post a comment.

 

Site Navigation

Popular Pages

The most popular pages on eKstreme.com.

Search

Subscribe

Subscribe to RSS 2.0 feed

Community

 
thermodelly