A Little Bump

Is Matt McGee the last person on Twitter? Seems so.

So come on, Googs, help him out.

Live.com Spambot Ignores robots.txt

Oh, MSNbot, when will you ever learn? I won't rehash the story that lead me to blocking MSN's referral-spamming bot, and that seems to have worked a bit. The problem is that the referral spam is still coming in! Yes, MSNbot is blocked but the spammy hits are still coming in.

Case in point, this hit from today over at Social Alerter:

/tips/how-not-get-dugg
  • At: 19 April 2008 11:04:39 AM GMT
  • Referred from: http://search.live.com/results.aspx?q=alerts&mrt=en-us&FORM=LIVSOP
  • Remote: livebot-65-55-165-107.search.live.com (65.55.165.107)
  • Request: HTTP/1.0 GET
  • Accepting:
    • HTTP: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
    • Charset:
    • Enconding:
    • Languages: en-us
  • UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
  • Cookies:

Is it just me or is this beyond comical now?

I’ve Left AdSense Speechless

The screenshot below is from my AdSense account. It seems I have reached the pinnacle of optimization as no new optimization suggestions have been recommended since February.

AdSense screenshot

Is this a bug or account specific? Each of the reports I see are different.

The Real Strategy Behind Google App Engine

I just had an "OMG this will change the world!" kind of moment while playing for just 5 minutes with Google's App Engine. Let me explain.

A bit of background first: The Google App Engine is a newly-launched service from Google, that for a change, seems to be well thought out. The service offers a Python-only environment (for now) to build applications locally and host them on Google's vast infrastructure. The idea here is that you don't have to worry about scaling your application to handle massive traffic and let the App Engine running on Google's servers deal with it. The Engine comes with lots of goodies like handling database stuff, user logins (and what a boon that will be for Google accounts), and others. All in all, a nice comfy environment for rapid application development and reliable hosting.

But from all the buzz on the net, I think there is something missing that I just hinted at above:

to build applications locally and host them on Google's vast infrastructure

App Engine comes with its own development setup that runs off your computer (available for Windows, OSX, and Linux). You develop the application on your computer, run it, test it, add features, and then upload it to Google's computers. My question is this: What's stopping Google from turning the local development code into a full desktop-based runtime for web applications? Why keep it as a development-only environment?

Let's look at this from another angle: the desktop-webapp integration market. Adobe recently released their oddly-named AIR (Adobe Integrated Runtime). In the AIR-world, you can write applications in HTML/CSS/JS or Actionscript and package them into desktop applications that run within AIR or within the Flash player in the browser. The AIR environment is available for Windows and Macs, and Linux support is on the way. Brilliant move: one code base, both browser and desktop functionality.

Microsoft also has a similar play in the form of .Net, and more specifically Silverlight. The .Net runtime is available for many devices and platforms (mobile, desktop, and I think even the XBox). With Silverlight, Microsoft's play is to give developers a platform to use .Net in the browser; this is coming in Silverlight 2.0 this summer. So with this, again, one code base can be used on the web and on the desktop to give true multi-platform programming.

There are other entries in this market, Mozilla Prism being a prominent example. They all promise the same thing: one code, many places to run it with varying details.

Now back to App Engine and to the question I posed: imagine Google comes out with a desktop runtime/environment that turns App Engine webapps into desktop-based apps. This will be directly parallel to Adobe's AIR but with a big difference: the same code will also be easily deployable on a reliable and scalable infrastructure - Adobe doesn't have that.

There is another difference: because of the way App Engine works, you could easily imagine it talking to Google Apps like Google Docs etc. A desktop App Engine will bring Google's applications onto the desktop and open up a market-disrupting war: direct office productivity competition with Microsoft. To rephrase, App Engine could be Google's way to enter Microsoft's turf on the desktop.

So any evidence for this? Nothing solid, so it's all speculation, but I'll point to three hints:

  • The name. It's not App Server or App Service but App Engine. Google understands branding well enough (it's arguably the main source of their traffic) so their choice of words here is intriguing. And I can't help but think that Google's App Engine will drive some sort of Google Gears. Nudge, nudge, wink, wink.
  • When creating an application, you can specify that only users of a certain Google Apps domain can use the app. This integration with Google Apps is perhaps hinting at bigger things to come.
  • The APIs available in App Engine: already App Engine supports dealing with mail, and given the point above, you can imagine an API for the other Google Apps. This would enable a go for the desktop market.

What do you think? I think this is the best move out of Google yet and as disruptive as AdWords was.

Killing Live.com Bot

I've had it. The live.com spambot, aka msnbot, is officially not welcome either here or at Social Alerter. Why? The bot is still referral spamming. How much? 100% of my live.com referrals at Social Alerter are actually the bot's spam. Granted the absolute number of hits is only in the low tens, but it is not right and such behavior is no longer welcome. And no, the constant lies that this behavior has stopped do not help.

For a background on this, start here, then read this post, and close off with the follow up.

Bye, bye. I hope to see you never.

Houston, We Have a Twitter

That's right folks. Today I decided to actually do something about my Twitter account. Follow me at pierrefar.

The question is *what* will I do with the account? It may be a few days before I dive in properly :) See you @twitter

How to *REALLY* Deal with Hackers

Donna over at SEO Scoop asks an excellent question: more and more we're seeing website attacks for SEO purposes, not more malicious intents (like stealing credit card details). Donna asks, how should we deal with this kind of attack? I'm going to hazard some suggestions.

First things first. We're not dealing with hackers. Nosiree, we're dealing with crackers. A hacker is a well-seasoned coder. A cracker is a hacker who exploits security holes for nefarious purposes.

With semantics out of the way, here are some suggestions:

  • Googlebomb yourself: If you get attacked with, for example, the Slash One Wordpress exploit, essentially you're going to get a lot of spammy "content" pages and lots of links to them. So what happens if you use .htaccess or otherwise to redirect all request to wp-content/1/* to, say, your site's home page? Or why not to your newly minted, specially created, [Texas holdem play online] site? Hey, you're probably going to get a lot of traffic, so use it! Here is the code:
    RewriteEngine On
    RewriteRule wp-content/1(.*)$ http://my-new-spammy-aff-site.com [R]
    Essentially, you'll googlebomb yourself with their links and use their traffic.
  • Use robots.txt as a defensive tool: A search engine doesn't need to see wp-content anyway, so block it:
    User-agent: *
    Disallow: /wp-content
  • It's the keywords stupid: you just got someone dump a load of keyword-laden pages with targeted keyword links back to them. Hello? Anyone care to turn this into a keyword research tool? Here is the pseudocode for the tool:
    Do a Google search for [inurl:wp-content/1]
    Scrape the URLs from the SERPs
    Scrape the spammy URLs
    For each spammy URL, do a [link:] search
    Scrape the backlinks and extract the anchor texts
    Save the keywords along with the spammy HTML
    Write a front-end to search the database
  • Report them! Figure out the IP address of the person who uploaded the spammy pages and report them. If you get trackback spam to the spammy pages, find the IP address of the trackback spammers and report them. Most SEO spammers will be using hosting services and their own computers. It is possible (although I'm guessing unlikely) they'll be using a proper botnet.

So like pretty much in SEO, perhaps even this can be dealt with using some creativity... I'm sure there are better ways to deal with such spam, and the idea is to think about the opportunities here. Good luck!

MS Live Still Referral Spamming

That's right folks, after the initial fuss, the backtracking (with its very own official statement!), Microsoft's Live search engine is still doing these referral spamming requests.

I'm seeing this on my new service Social Alerter. The request details:

  • Remote: livebot-65-55-165-77.search.live.com (65.55.165.77)
  • UA: of Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
  • Referring URL: http://search.live.com/results.aspx?q=social&mrt=en-us&FORM=LIVSOP

Full list of IP addresses doing this:

  1. 65.55.165.90
  2. 65.55.165.43
  3. 65.55.165.96
  4. 65.55.165.120
  5. 65.55.165.100
  6. 65.55.165.76
  7. 65.55.165.16

The fake search queries are all either [social] or [alerts].

Anyone else seeing this? It's clearly not fixed as they claimed and is starting to get annoying.

Announcing Social Alerter

Doesn't it suck when you discover your site is down because a page went popular on Digg? Wouldn't it be nice if you somehow knew that your site is slowly inching its way up the upcoming list? And what about delicious? That could be a serious hit of traffic too.

Well now you can get a warning. Over the past few months, I've been slowly building a service called Social Alerter. Social Alerter is a free service that alerts you when your websites are about to go popular on Digg and delicious. You can monitor as many sites as you want and once it finds one, it sends you an email. You can use it to monitor your own sites, your competitors' sites (ha ;) ), and your favorite sites. You simply sign up and know that there is an eye out doing all the leg work.

This is the service in a nutshell. I've written a huge help section and if you read just one page, read the Social Alerter crash course.

Review of 2007, Predictions of 2008

This kind of post is something a few bloggers do. I enjoy reading them, so I thought I'd try my hand. It's a bit of the final score-card for the year and hopefully inspiration to do better (whatever that actually means) for next year. So what happened with me in 2007?

January

January is probably the best month of 2007. It kicked off the year with my first ever digg home pager, me doing a live podcast/talkshow, and the first rant of the year which set the pace for the months to come :D

It wasn't all fun and joy though: in January, eKstreme.com suffered a DoS attack.

February

February brought lots of developments: I started moderating at Cre8 a Site Forums, easily the friendliest place on the net. The second Digg home pager arrived too, and a major statistical analysis of the Socializer data got a lot of people interested.

March-July

Very quiet period. In March, I was busy thinking about my online strategy about eKstreme.com, blogSci.com, and the other major property I owned back then, fontfox.com. The outcome of that is a major change (for the better!) monitization effort of eKstreme.com, a decision to keep blogSci.com ad-free, and realizing that I wasn't doing much with fontfox. In the end, fonfox got sold in May.

In July, this blog got its first ever guest post. It was a great piece. However, this effort to bring fresh blood into this site was a dud: a lot of other people agreed to blog post but none actually sent me stuff :( Waaah.

Of course, lots of ranty anti-Google posts were written in this period. Back then, Google thought it was OK to abuse user data in many ways. To this day I still think they are abusing our data and it will probably get worse in 2008.

June-August

While the blogging was quiet, a lot was happening in the background. The CMS of eKstreme.com has been showing its age and slowing things down. The strategic review in February concluded that this has to be fixed. So the whole site was moved to use Wordpress as the CMS, which involved a lot of hacking to get WP to like my SEO tools and not break them. I also moved hosts.

July onwards

I started taking a very close look at the bots/crawlers hitting eKstreme.com and blogSci.com. This research resulted in a lot of bot-related posts and insights. I'm still collecting data to learn more about how bots look like. By bots, I mean the more malicious scraper spammy types, not the nice ones like Googlebot and Slurp!.

Out of this also came the realization that msnbot was misbehaving. First, the authentication was broken, second, it was not obeying the robots.txt file, and thirdly, a very strange pattern of bot activity from live.com was detected. This resulted in third Digg home pager. A few weeks later, MS backtracked. I don't know if it had anything to do with my post or not - I doubt it.

All in all, a great year. Stay tuned for 2008 because there is a lot of great stuff coming. They'll be announced here as always.

Predictions for 2008

Now the really fun part :) What will happen in 2008? Here are some of my predictions:

  • Online office: Microsoft will release Silverlight 2.0 in early '08 (we already know that). Shortly afterwards, they'll release an online version of Office based on that. This will disrupt the market, making Google's Apps look like toys and Zoho very very vulnerable. Zoho will get acquired.

  • At least one major privacy scare on the web. Top contenders are Google and Facebook, but Microsoft cannot be discounted. My prediction is that it will be related to user profiling for ad-targeting purposes.
  • Rich Internet Applications (RIA) will arrive in full force. Everyone will look at each other and go 'eh' until a killer app is released. That app will probably be the online MS Office. Top contenders are Silverlight and Flex from Adobe. Flex has no chance against Silverlight because Adobe doesn't know how to write web-friendly software (like Acrobat Reader plugin for browsers, which sucks) and certainly is no match for the developer-friendly MS. Flex will live through 2008 though because end-consumers will think it's the Flash player.
  • In Search: Google will continue to dominate, but slow down its growth. Semantic search engines like Powerset (which I'm a member of the public beta testers) will rock. Hakia will figure out that its biggest obstacle to world domination is its index: full of spam and very stale. Their technology is great though.
  • Yahoo will chug along. A few gem products will come out of their R&D efforts along with the continuous stream of half-baked ideas. The new delicious service, which finally loses the an.nno.ying dots from its name will be a great hit.
  • Generally: more memes and more bloggers working in synchrony for a common cause.

So... will I eat my words in December 2008? Stick around and you'll find out :)

GTalk Translator Bot is Mediocre but Useful

By now you must have heard that Google Talk now includes translation bots you can invite into a conversation. When you invite any of these bots, they translate whatever you type from your language into the target language. A very brilliant idea with a perfect implementation mechanism, but does it work? Let's find out.

I've mentioned before that I am an Arabic speaker. Given that Arabic sports one of the most convoluted grammars on Earth, I thought what better way to test the bots by having a solo chat with the en2ar bot. That is, I write in English and watch its Arabic responses. The results are below:

Google Talk translation bot conversation translating English to Arabic

Arabic speakers among you will spot many mistakes but the ideas are still mostly translated well. With basic phrases, the translation is flawless in most cases. With more convoluted writing, the translation breaks down. You can see two comments relating a bad translation. The first one said "This translation sucks" which colloquially in English, that means it's bad. The translation used the meaning of "suck" literally, i.e., something you'd do to straw and some juice. The next phrase saying "This is a bad translation" was translated, well, badly, but the idea was still conveyed. The translation in Arabic actually says "This is the bad of translation". This grammatical structure is used in Arabic to emphasize the pinnacle of something (i.e. exemplary in its class), so in this case, the Arabic actually means "This is the worst of translation".

So all in all a useful feature but I don't see it being used for anything important like a business chat: the mistakes are simply too frequent for this to be used to convey complex ideas. It is machine translation after all and the state of the art is still bad.

Query String Collapsing

One of the problems that search engines and analytics packages have is dealing with URLs with query strings. For example, the following two URLs will be return the same content from any given content management system but they are two different URLs in the eyes of search engines and analytics packages:

http://example.com/page.php?id=1&title=hello&from=homepage

http://example.com/page.php?title=hello&from=homepage&id=1

So how can we figure out that they are actually the same URL really? The solution I came up with is a simple multi-step processing algo. It goes like this:

  • Take the query string variables and save them in an array. So in the case of our first URL, the array would contain the following key=>value pairs:

    $vars = array('id'=>'1', 'title'=>'hello', 'from'=>'homepage');
  • Next, sort the array by alphabetical order based on the keys names, like:

    $vars = array('from'=>'homepage', 'id'=>'1', 'title'=>'hello');
  • Now rebuild the URL based on the new order of the variables:

    http://example.com/page.php?from=homepage&id=1&title=hello
  • By now the trick should be clear: if you do that to all the URLs, you would always reach the same final re-composed URL as long as the variables are same (i.e. the same names and one URL doesn't have extra or missing variables).

I call this Query String Collapsing. Why "collapsing" instead of normalization or decomposition? No real reason apart from thinking about this as collapsing a whole slew of URLs into a single representative entity. And I just like that name more that way :)

With this, what can we do with analytics? Save both the original URL as requested and the collapsed URL. This opens up a nice set of funky things you can do, but that's another post...

Irony

Support Wikipedia!

Hint: Look at the source code...

MS Admits to Referral Spamming for As Cloaking Check

Hot off the press: after the fuss raised by a bunch of us a few weeks ago, Donna now reports that Live ponies up about the referrer spam. They've issued a statement where they:

  • A bug that caused issues with AdSense/Overture reporting.
  • Distorting site statistics with unfilterable bot traffic (except we know how to filter them!)
  • Polluting HTTP logs with inappropriate terms (true).

Microsoft also states that "Hopefully webmasters have also noticed these issues disappearing. If you are still experiencing any issues, please contact us before you block MSNBot, to see if we can address the issue."

Let me be the first to say a big thank you to Microsoft for making a very solid public response to the issue and answering our questions. This kind of transparency is exactly what fosters a good relationship between a search engine and webmasters.

And yes, Live.com team, I do default to your search engine for my searches. Works a treat (most of the time ;) ).

Yell if Microsoft’s Live.com Spammed You Too - Updated

Welcome Reddit, Digg, and StumbleUpon users! If you like this post, please vote below. Thank you!

Update 2: Yuri explains more background and asks What happens next?. Reuben Yau and Kichus have both blocked the IP addresses. Boy are people angry.

Update 1: DazzlinDonna from SEO Scoop has written an excellent background to this fiasco, and Michael VanDeMar is reporting that Microsoft is interfering with AdSense. Ouch.


The bot analysis continues, and this post presents evidence indicating that Microsoft is spamming websites. A big claim, I know, but I can't find a better explanation. You'll have to decide.

The summary: IP addresses belonging to Microsoft are requesting pages from eKstreme.com and blogSci.com (my science blog) with HTTP referer headers suggesting that the hits were from live.com searches. These referer headers are spoofed as the keywords from these supposed searches are sometimes in no way related to the requested page. Additionally, for most of the other supposed searches, the requested pages do not rank in the top 10 (first page of results) in a way to send this traffic.

For some odd reason, the webmaster community has known about this for a couple of months. In September, SE Roundtable posted about other webmasters complaining about this spam. Surprisingly, we also got official confirmation (via a WMW thread) from msndude that this indeed happening and it's (and I'm quoting) "part of a quality check we run on selected pages". This is an unacceptable explanation as you'll see from the data below because it has none of the hallmarks of a quality check but all the marks of referral spam.

The hits discussed below are extracted from the blogSci.com data to keep things simple, but a similar data set exists for eKstreme.com.

The Hits

The whole list of hits is way too long to quote in full here, so here is a sampling of my favorite requests:

  • At: 17 August 2007 05:53:27 PM GMT
  • Routed to: /index.php
  • Referred from: http://search.live.com/result.aspx?q=make+money+online&mrt=en-us&FORM=LVSP
  • Remote: bl2sch1082213.phx.gbl [] (65.55.165.119)
  • Request: HTTP/1.0 GET
  • Accepting:
    • HTTP: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
    • Charset:
    • Enconding:
    • Languages: en-us
  • UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
  • Cookies:

  • At: 18 August 2007 03:05:43 PM GMT
  • Routed to: /index.php
  • Referred from: http://search.live.com/result.aspx?q=make+money+online&mrt=en-us&FORM=LVSP
  • Remote: bl2sch1082008.phx.gbl [] (65.55.165.66)
  • Request: HTTP/1.0 GET
  • Accepting:
    • HTTP: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
    • Charset:
    • Enconding:
    • Languages: en-us
  • UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
  • Cookies:

These two hits above are the first I have in my records. What's amusing about them is that both supposedly came from a search for [make money online].

  • At: 19 August 2007 03:55:48 AM GMT
  • Routed to: /index.php
  • Referred from: http://search.live.com/result.aspx?q=ticket&mrt=en-us&FORM=LVSP
  • Remote: bl2sch1081815.phx.gbl [] (65.55.165.25)
  • Request: HTTP/1.0 GET
  • Accepting:
    • HTTP: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
    • Charset:
    • Enconding:
    • Languages: en-us
  • UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
  • Cookies:

This one is also very random: a blog post about a cool new magnet-based technology to create colors is ranking in the top 10 for the query [ticket]? Not even Live.com generates such irrelevant results.

Anything more recent? Sure:

  • At: 11 November 2007 03:26:43 PM GMT
  • Routed to: /index.php
  • Referred from: http://search.live.com/results.aspx?q=osteoporosis&mrt=en-us&FORM=LIVSOP
  • Remote: bl2sch1081815.phx.gbl [] (65.55.165.25)
  • Request: HTTP/1.0 GET
  • Accepting:
    • HTTP: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
    • Charset:
    • Enconding:
    • Languages: en-us
  • UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
  • Cookies:

  • At: 11 November 2007 03:29:24 PM GMT
  • Routed to: /index.php
  • Referred from: http://search.live.com/results.aspx?q=amazon&mrt=en-us&FORM=LIVSOP
  • Remote: bl2sch1081909.phx.gbl [] (65.55.165.43)
  • Request: HTTP/1.0 GET
  • Accepting:
    • HTTP: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
    • Charset:
    • Enconding:
    • Languages: en-us
  • UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
  • Cookies:

At the time of writing, there are 245 such hits in my records since August 2007.

Not convinced? There is more. Some of these hits came within seconds of being indexed by MSNBot. The pattern is like this: the page is requested by MSNBot (which is authenticated, so it's genuine) and within a few seconds, the very same page is requested as described above with a live.com search are referer. An example:

  • At: 10 November 2007 12:05:14 PM GMT
  • Routed to: /index.php
  • Referred from: (No referer.)
  • Remote: livebot-65-55-209-143.search.live.com [] (65.55.209.143)
  • Request: HTTP/1.0 GET
  • Accepting:
    • HTTP: text/html, text/plain, text/xml, application/*, Model/vnd.dwf, drawing/x-dwf
    • Charset:
    • Enconding: identity;q=1.0
    • Languages:
  • UA: msnbot/1.0 (+http://search.msn.com/msnbot.htm)
  • Cookies:
  • At: 10 November 2007 12:05:36 PM GMT
  • Routed to: /index.php
  • Referred from: http://search.live.com/results.aspx?q=problem&mrt=en-us&FORM=LIVSOP
  • Remote: bl2sch1081810.phx.gbl [] (65.55.165.20)
  • Request: HTTP/1.0 GET
  • Accepting:
    • HTTP: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
    • Charset:
    • Enconding:
    • Languages: en-us
  • UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
  • Cookies:

The typical delay between the indexing request and the spoofed search hit request is 5-20 seconds.

How to Recognize the Fake Hits

Anyone staring at these hits long enough will see some signatures to detect them:

  • Note how all of them have identical user agents (UA field) and pretty much everything else is identical (bar the the requested page and the referer).
  • The IP adresses all belong to the same C-block, namely 65.55.165.*.
  • All of the query strings in the live.com referrers have &mrt=en-us in them. Here in the UK, I get &mkt=en-gb when I really use Live.com for a search.

Needless to say, this smells like bot behavior.

An Analysis

Let's think about this for a minute: What on Earth is going? Why are these hits happening? I can think of two explanations:

  • The tinfoil/sinister explanation: pure spam from MS. Why? So that webmasters see Live.com referrals coming in increasing numbers. This is not hard to hide: if you only get like 10 referrals from live.com a month, another 10 is a doubling but which sad webmaster would check those out (apart from me)?
  • The "surely not" explanation: this is an automated way to check the search results to see where pages rank for keywords the page could potentially rank for. This is what msndude confirmed in the WMW thread, but as you can see above, it doesn't really look like a quality check. Also, if this is indeed a quality check, why not run it on the cached pages and not alert (and annoy) the webmasters? Microsoft have full access to their index and they should use it!

I subscribe firmly to the first explanation: the search keywords are spammy in some cases, always too general, the requested pages never rank in the top 10 as the referring URLs would suggest, the hits have identical user agents (i.e. not the typical variation you would expect from various people using normal browsers on different operating systems withing the same company to show) and the actual referring URL does not match what a human being searching on live.com generates.

In short: it's spam and not a quality control check. What do you think?

Open Handset Alliance

Dear Google,
You suck.
Love,
Apple

Seriously folks, why isn't Google's adopted child, the Apple's iPhone, not part of the Open Handset Alliance. And Microsoft?

Anyone want to take bets that MS and Apple create their own alliance? That would be fun.

PHP Auto Prepend and its Uses

A thread at SEO Refugee started with asking for help about a funny URL, which we deciphered to be a probe to try to attack the website. I've seen this before and so I suggested that they block the whole IP C-block. Which turned into the question of: how do you block IP addresses?

The way I do it is using PHP and it uses a nice little trick that few seem to know about. This is how:

PHP has a feature that allows you to pre-pend a file at every PHP request. This prepend file is the equivalent of having it include()ed at the top of every single PHP script on your site. It's is done through a directive that is set either in php.ini or .htaccess. The directive is called auto_prepend_file. For .htaccess, this is what I use:

php_value auto_prepend_file "/full/path/to/a/prepend-file.php"

Because it runs at every PHP request and it runs before the actual requested script, you can do some really neat things. So what do I do? I'm developing this system internally and at the moment it does three things:

  1. Authenticate SE bots
  2. Analytics (the data logger)
  3. Block IP addresses

The blocking works as follows: there is a special directory where I put empty files that dictate the blocking. The file names are of two formats: a.b.c.d or a.b.c depending if want to block a specific IP address (the former format) or a C-block (the latter). In the pre-pend file, there is a simple check: figure out the remote IP address, and check the for the presence of either its file or its C-block file. So if the remote IP is 111.222.333.444, it checks for the prsence of either /111.222.333 or /111.222.333.444. If either exist, a 403 not authorized header is returned and the code exit()s, so no actual content gets displayed.

This raises the question: how do you add files to the directory? Using a web interface of course :) You can do it with a simple touch() or an fopen().

For completeness, there is a sister directive called auto_append_file which runs after each PHP script is called (with the exception that if script exit()s, the append file doesn't run). I've never used it, but it can be useful for things like measuring how quickly scripts run on your server.

What a Great Day - An Analysis

Today is a great day to be... not Google. As someone who's spoken out many times against Google and its practices, I'm very happy today. This is not a simple ranting post, so please bear with me as I explain away the smile. Two reasons to be smiling:

  • Google slapped their most vocal supporters in the face. Actually, they kicked them in the groin and when they went down, Google took a big stick and hit them on the head. Yes, it's the PR "downdate" (getit?) of today. I'll explain why this is a stupid move (if it's not a glitch).
  • Google lost out to Microsoft today. Ironically, it just proves that as prying as Google wants to be by targeting ads to users, there are others willing to be even more 'evil' (for the lack of a more descriptive word). It really does seem that old-skool pre-Web companies may still be able to teach G a thing or two.

First the PR update. There is no evidence to suggest that this is due to link selling. There is no evidence to suggest it's a vandetta against people who've spoken against Big G. It could be a glitch, but my favorite theory: ALL toolbar PR will drop to zero. Why? The publicly visible PR has always been a thorn in Google's back: everyone watches it, people use it to assert authority, people try to manipulate it, and worst of all, it created a market because it's in a finite supply.

I also think the visible PR kick-started a mentality of not linking. When it started, it was "I don't want PR to leak" and so people said "leak it for $$$". People saw that this was working well as a money source and worth it as an investment for both traffic, PR, and rankings. So Google responded, half-heartedly, by delaying the publication of the true PR but a few months. That was annoying but it worked. What it also meant that on some (most?) valuable sites, getting a link was virtually impossible without forking over some cash.

Next came the mangling of nofollow. It first started as this innocent anti-spam measure. Yeah, right. As if that was ever going to deter spammers (just ask the email spam filter companies). If anything, that made spammers get more creative and we're still happily getting spammed. So now we had this kinda useless tool until some genius figured it out: it can manage the flow of PR in a site. Heck, I can now even link out to other sites I wouldn't have because I can tell Google to ignore the link. Think about that: I am actively linking to a website while at the same time sending a message to Google that I don't trust said site. Hypocrisy at it's finest, all thanks to Google.

As you can imagine, this is not a stable situation. Eventually, a drastic measure would be needed to fix it. The most obvious one? Kill the visible PR. Take out the symptoms and the disease. What happens if you make all PR in the world zero? It will become useless. You take away the commodity that's being traded. And hopefully, you'll save the net from the mess you created by freeing people about worries of linking to each other again. The corollary to that is nuking nofollow, which I honestly believe is also required. However, let me be the first to note there is no evidence of this happening. None.

So let us for a moment assume that today's PR update (which hit eKstreme.com, by the way) is not a glitch. Let's assume it's a planned move, which would include a PR algo update. What does it mean?

  • One possibility is that all PR is going down to zero, as I think it eventually will. Only time will tell if this is true or not.
  • Another possibility is that Google is penalizing people, either with a biased algo update or with manual intervention. If so, why on Earth hit the people who speak about the company the most? In all markets, especially techie ones, there is an adoption curve: there are the pioneers who are the most addicted to your product, who will speak about you in holy terms, and who will infect everyone around them to use your product. Successful startups sell to these people first, and established companies make sure these guys and gals are happy. Keep this free loud-speaker marketing channel happy, and you'll be happy. Google just made this group of people unhappy. Google has stepped over the line. People are furious. Google will pay. How? We will start seeing Google for what it really is: an ad agency out to make a buck - a boring company! - not some "not evil" librarian out to index the world. This is the first step to people switching away from Google. This is the first step to Google losing its grip. And you know what? The competition would be willing to take on these refugees.

Anyway, I've rambled on this too much. On to the Google-Microsoft-Facebook love triangle.

Microsoft was always the most likely winner since they already had a partnership with Facebook. As TechCrunch put it, it is the path of least resistance. Also, we should note that this was never about the money: it's a political win and a winning of mindshare. Also, whoever won would officially become the most spying, prying, ad agency on the web - in the world! Google's stated goal is to target advertising based on search history and other personal data. They call it 'personalization'. Everyone kicked up a fuss about how naughty this is, so much so, that when Google wanted to buy DoubleClick, the cries became louder: Google has too much creepy oversight over us.

Still, whatever info Google had, it's nothing compared to what Facebook has - in relation, Google's knowledge of me is harmless compared to what I have in my FB profile. Imagine an ad agency: would you like to target 25-30 year old males in a relationship with a woman and having liberal political and religious views? Facebook will say "Sure! We got some of those!" Google has nothing to say about that.

The level of targeting can be ridiculous: wanna target people who may have missed someone's birthday? How about those who were recently in a relationship but are not anymore? What about those who just entered a relationship? How about accurate geotargeting? All this info and much more is readily available through Facebook, and now Microsoft has access.

So all in all, it's an exciting and eventful day, but looking at the potential privacy worries, it doesn't bode well for the future. Whatever happens, today is very likely to go down in history as a turning point for both Google and Microsoft.

While I'm rambling, I'll finish with some predictions:

  • Facebook is already the home page for many people. They will add web search from Microsoft. Ooops.
  • Google will fight back with some serious innovation in AdWords. This will be good for the advertisers but not so good for users.
  • Google will win the EU antitrust hearts and be cleared to buy DoubleClick. All they need to do is point at the MS-FB deal.
  • MS now has Digg and Facebook. Who's next? Federated Media (the ad agency) comes to mind.

All thoughts welcome below :)

Introducing Open Keyword

Yesterday, I posted my 2000th post on Cre8asite Forums where I moderate. The post was about writing tools (again!) but this time with a twist: the tool is released under a BSD license, meaning it can be downloaded and used as you see fit. The project, called Open Keyword, is a keyword generation tool for SEO purposes.

Open Keyword has two components:

  • A keyword searching tool that gets related keywords from Google Suggest, Yahoo! Live Search, and Yahoo Related Search.
  • A keyword scraping tool that gets the list of keywords given by Google Trends Hot List that's updated hourly. These keywords are stored into a database and retrieved by the search script.

There are two key things about Open Keyword that make it unique:

  • You run it yourself, meaning you can customize the code, for example to alert you when a keyword has become popular. Also, no one will be able to log your keyword research activity (as would happen if you use tools on other sites) and because you have the data, you can do funky custom analysis yourself.
  • Because of the data sources, you know the keywords are the most popular according to the search engines. As such, they can be used as seeds for further keyword research to build up the keyword list.

And since Open Keyword is open source, anyone can write improvements. If you do, please share them with me and I'll incorporate them so that everyone benefits.

So all you have to do now is go to the Open Keyword home page and download it. Full instructions on setting it up are also on the page. And if you get stuck gimme a shout. Comments and thoughts below please :)

Arabic SEO

I've been thinking about using Arabic in URLs, a question asked by Rand Fishkin of SEOmoz over at Cre8. Rand's question was:

What if you are optimizing in the Arabic character language set and want to include "keywords" in your URL

As an Arabic speaker and user of Arabic websites, I feel I can help answer this one. The answer is applicable to other languages as it deals with technical issues faced by all non-English language. Arabic is merely the language we draw specific examples from. So here goes...

Talking in (en)code

URLs are allowed only a certain set of characters for them to work: the English alphabet (both lowercase and uppercase), the numbers, dashes, dots, forward slashes, and the question mark, and a few others. These chosen few of characters are based on American English as defined by the ASCII standard for historical reasons. All other characters, like English punctuation and non-English characters have to encoded.

The question needs to be answered for domain names too. Wikipedia has a nice summary of international domain names that allow non-ASCII characters in them. However, support for that is not universal yet, and as we'll see later, different browsers will handle internation domain names differently. For now, I would recommend steering clear of these for SEO purposes.

Usability trumps the day?

OK, so we know that non-ASCII characters have to be encoded, and so what about Rand's question about keywords in the URL? This raises a very interesting question: If you know the URLs are going to be encoded, doesn't usability dictate that you use non-encoded text? So given the choice between these two URLs:

  • site.com/?page=%D8%AF%D9%84%D9%8A%D9%84
  • site.com/?page=directory (Rand's question was about a directory as in DMOZ not as in folder)

which one would you choose? Is the presence of the word in the URL really that key for ranking?

I would argue that in this case, for usability's sake, I would go for something like:

  • site.com/directory
  • site.com/node/1

Then the actual page contents will be in Arabic or any other language. The anchor text is also key, so in-site optimization becomes super-critical, not to mention on-page techniques.

International domain names

The other thing to consider is how users input URLs. Do they type them? How important is type-in traffic for the site under question? Most likely, people will type the domain name in English. Speaking of which, try this site: search that points to an Arabic domain name (see PS below as to why I'm link to Google to give the domain name) and watch the URL in different browsers. Safari keeps the URL as it is, but Firefox and IE 7 change it to http://xn--ugb6bax.com/. That last URL is certainly not memory-friendly.

So to sum up: I would be careful how I use non-English letters in URLs.

What about CLIR?

Back in May, Google came out with the all-singing-all-dancing-Ask.com-copycat called Universal Search. Buried in the announcements is a little gem called Cross-Language Information Retrieval (also see this). From SEL's post:

Search queries will be entered in the native language, translated into English and run against Google's index. Any retrieved pages/sites will then be translated from English back into the native language.

I'm sure this will affect the kind of SEO we're talking about here, but I haven't done any tests to see how and how much. Anyone got data?

Anglicized Arabic

Another thing you will notice is anglicized Arabic (Arabic written in English): sometimes you'll see numbers in the middle of the words. This is because there is a colloquial transliteration system developed over the past decade or so (thanks to mobile phones and the internet!) to write Arabic-only letters in English. Example: Arabic has two H-like characters. There is one pronounced as the H in Henry and one as more deeper, sounding almost like an H you would if you have a scratched throat. The Henry-like H is transliterated into H in English, and the second type of H is transliterated into a 7.

How the search engines handle (parse and index) such transliterated text is a very big question. A quick search for [7mar] (donkey in Arabic :), which is used as an insult along the lines of stupid or moron) shows that quite a few pages are indexed with that "word" in it. Interestingly, Google thinks these pages are English, and if you do an Arabic search specifically, you get another set of SERPs suggesting that unless explicitly told it's Arabic, Google at least will get confused.

There is more to this story, as part of another bigger story, but that's for another post. In the meantime, please post questions and comments below :)

PS - Why isn't there a word of Arabic in this post? It's because WordPress thinks it's the best piece of software in the world and keeps editing my Arabic into question marks. Using either IE or FF does not fix this problem.

« Previous Entries  

Site Navigation

Blog Categories

Popular Pages

The most popular pages on eKstreme.com.

Search

Subscribe

Subscribe to RSS 2.0 feed

Community

 
thermodelly