A Little Bump
Is Matt McGee the last person on Twitter? Seems so.
So come on, Googs, help him out.
Is Matt McGee the last person on Twitter? Seems so.
So come on, Googs, help him out.
Oh, MSNbot, when will you ever learn? I won't rehash the story that lead me to blocking MSN's referral-spamming bot, and that seems to have worked a bit. The problem is that the referral spam is still coming in! Yes, MSNbot is blocked but the spammy hits are still coming in.
Case in point, this hit from today over at Social Alerter:
Is it just me or is this beyond comical now?
The screenshot below is from my AdSense account. It seems I have reached the pinnacle of optimization as no new optimization suggestions have been recommended since February.

Is this a bug or account specific? Each of the reports I see are different.
I just had an "OMG this will change the world!" kind of moment while playing for just 5 minutes with Google's App Engine. Let me explain.
A bit of background first: The Google App Engine is a newly-launched service from Google, that for a change, seems to be well thought out. The service offers a Python-only environment (for now) to build applications locally and host them on Google's vast infrastructure. The idea here is that you don't have to worry about scaling your application to handle massive traffic and let the App Engine running on Google's servers deal with it. The Engine comes with lots of goodies like handling database stuff, user logins (and what a boon that will be for Google accounts), and others. All in all, a nice comfy environment for rapid application development and reliable hosting.
But from all the buzz on the net, I think there is something missing that I just hinted at above:
to build applications locally and host them on Google's vast infrastructure
App Engine comes with its own development setup that runs off your computer (available for Windows, OSX, and Linux). You develop the application on your computer, run it, test it, add features, and then upload it to Google's computers. My question is this: What's stopping Google from turning the local development code into a full desktop-based runtime for web applications? Why keep it as a development-only environment?
Let's look at this from another angle: the desktop-webapp integration market. Adobe recently released their oddly-named AIR (Adobe Integrated Runtime). In the AIR-world, you can write applications in HTML/CSS/JS or Actionscript and package them into desktop applications that run within AIR or within the Flash player in the browser. The AIR environment is available for Windows and Macs, and Linux support is on the way. Brilliant move: one code base, both browser and desktop functionality.
Microsoft also has a similar play in the form of .Net, and more specifically Silverlight. The .Net runtime is available for many devices and platforms (mobile, desktop, and I think even the XBox). With Silverlight, Microsoft's play is to give developers a platform to use .Net in the browser; this is coming in Silverlight 2.0 this summer. So with this, again, one code base can be used on the web and on the desktop to give true multi-platform programming.
There are other entries in this market, Mozilla Prism being a prominent example. They all promise the same thing: one code, many places to run it with varying details.
Now back to App Engine and to the question I posed: imagine Google comes out with a desktop runtime/environment that turns App Engine webapps into desktop-based apps. This will be directly parallel to Adobe's AIR but with a big difference: the same code will also be easily deployable on a reliable and scalable infrastructure - Adobe doesn't have that.
There is another difference: because of the way App Engine works, you could easily imagine it talking to Google Apps like Google Docs etc. A desktop App Engine will bring Google's applications onto the desktop and open up a market-disrupting war: direct office productivity competition with Microsoft. To rephrase, App Engine could be Google's way to enter Microsoft's turf on the desktop.
So any evidence for this? Nothing solid, so it's all speculation, but I'll point to three hints:
What do you think? I think this is the best move out of Google yet and as disruptive as AdWords was.
I've had it. The live.com spambot, aka msnbot, is officially not welcome either here or at Social Alerter. Why? The bot is still referral spamming. How much? 100% of my live.com referrals at Social Alerter are actually the bot's spam. Granted the absolute number of hits is only in the low tens, but it is not right and such behavior is no longer welcome. And no, the constant lies that this behavior has stopped do not help.
For a background on this, start here, then read this post, and close off with the follow up.
Bye, bye. I hope to see you never.
That's right folks. Today I decided to actually do something about my Twitter account. Follow me at pierrefar.
The question is *what* will I do with the account? It may be a few days before I dive in properly
See you @twitter
Donna over at SEO Scoop asks an excellent question: more and more we're seeing website attacks for SEO purposes, not more malicious intents (like stealing credit card details). Donna asks, how should we deal with this kind of attack? I'm going to hazard some suggestions.
First things first. We're not dealing with hackers. Nosiree, we're dealing with crackers. A hacker is a well-seasoned coder. A cracker is a hacker who exploits security holes for nefarious purposes.
With semantics out of the way, here are some suggestions:
So like pretty much in SEO, perhaps even this can be dealt with using some creativity... I'm sure there are better ways to deal with such spam, and the idea is to think about the opportunities here. Good luck!
That's right folks, after the initial fuss, the backtracking (with its very own official statement!), Microsoft's Live search engine is still doing these referral spamming requests.
I'm seeing this on my new service Social Alerter. The request details:
Full list of IP addresses doing this:
The fake search queries are all either [social] or [alerts].
Anyone else seeing this? It's clearly not fixed as they claimed and is starting to get annoying.
Doesn't it suck when you discover your site is down because a page went popular on Digg? Wouldn't it be nice if you somehow knew that your site is slowly inching its way up the upcoming list? And what about delicious? That could be a serious hit of traffic too.
Well now you can get a warning. Over the past few months, I've been slowly building a service called Social Alerter. Social Alerter is a free service that alerts you when your websites are about to go popular on Digg and delicious. You can monitor as many sites as you want and once it finds one, it sends you an email. You can use it to monitor your own sites, your competitors' sites (ha
), and your favorite sites. You simply sign up and know that there is an eye out doing all the leg work.
This is the service in a nutshell. I've written a huge help section and if you read just one page, read the Social Alerter crash course.
This kind of post is something a few bloggers do. I enjoy reading them, so I thought I'd try my hand. It's a bit of the final score-card for the year and hopefully inspiration to do better (whatever that actually means) for next year. So what happened with me in 2007?
January is probably the best month of 2007. It kicked off the year with my first ever digg home pager, me doing a live podcast/talkshow, and the first rant of the year which set the pace for the months to come
It wasn't all fun and joy though: in January, eKstreme.com suffered a DoS attack.
February brought lots of developments: I started moderating at Cre8 a Site Forums, easily the friendliest place on the net. The second Digg home pager arrived too, and a major statistical analysis of the Socializer data got a lot of people interested.
Very quiet period. In March, I was busy thinking about my online strategy about eKstreme.com, blogSci.com, and the other major property I owned back then, fontfox.com. The outcome of that is a major change (for the better!) monitization effort of eKstreme.com, a decision to keep blogSci.com ad-free, and realizing that I wasn't doing much with fontfox. In the end, fonfox got sold in May.
In July, this blog got its first ever guest post. It was a great piece. However, this effort to bring fresh blood into this site was a dud: a lot of other people agreed to blog post but none actually sent me stuff
Waaah.
Of course, lots of ranty anti-Google posts were written in this period. Back then, Google thought it was OK to abuse user data in many ways. To this day I still think they are abusing our data and it will probably get worse in 2008.
While the blogging was quiet, a lot was happening in the background. The CMS of eKstreme.com has been showing its age and slowing things down. The strategic review in February concluded that this has to be fixed. So the whole site was moved to use Wordpress as the CMS, which involved a lot of hacking to get WP to like my SEO tools and not break them. I also moved hosts.
I started taking a very close look at the bots/crawlers hitting eKstreme.com and blogSci.com. This research resulted in a lot of bot-related posts and insights. I'm still collecting data to learn more about how bots look like. By bots, I mean the more malicious scraper spammy types, not the nice ones like Googlebot and Slurp!.
Out of this also came the realization that msnbot was misbehaving. First, the authentication was broken, second, it was not obeying the robots.txt file, and thirdly, a very strange pattern of bot activity from live.com was detected. This resulted in third Digg home pager. A few weeks later, MS backtracked. I don't know if it had anything to do with my post or not - I doubt it.
All in all, a great year. Stay tuned for 2008 because there is a lot of great stuff coming. They'll be announced here as always.
Now the really fun part
What will happen in 2008? Here are some of my predictions:
So... will I eat my words in December 2008? Stick around and you'll find out
By now you must have heard that Google Talk now includes translation bots you can invite into a conversation. When you invite any of these bots, they translate whatever you type from your language into the target language. A very brilliant idea with a perfect implementation mechanism, but does it work? Let's find out.
I've mentioned before that I am an Arabic speaker. Given that Arabic sports one of the most convoluted grammars on Earth, I thought what better way to test the bots by having a solo chat with the en2ar bot. That is, I write in English and watch its Arabic responses. The results are below:

Arabic speakers among you will spot many mistakes but the ideas are still mostly translated well. With basic phrases, the translation is flawless in most cases. With more convoluted writing, the translation breaks down. You can see two comments relating a bad translation. The first one said "This translation sucks" which colloquially in English, that means it's bad. The translation used the meaning of "suck" literally, i.e., something you'd do to straw and some juice. The next phrase saying "This is a bad translation" was translated, well, badly, but the idea was still conveyed. The translation in Arabic actually says "This is the bad of translation". This grammatical structure is used in Arabic to emphasize the pinnacle of something (i.e. exemplary in its class), so in this case, the Arabic actually means "This is the worst of translation".
So all in all a useful feature but I don't see it being used for anything important like a business chat: the mistakes are simply too frequent for this to be used to convey complex ideas. It is machine translation after all and the state of the art is still bad.
One of the problems that search engines and analytics packages have is dealing with URLs with query strings. For example, the following two URLs will be return the same content from any given content management system but they are two different URLs in the eyes of search engines and analytics packages:
http://example.com/page.php?id=1&title=hello&from=homepage
http://example.com/page.php?title=hello&from=homepage&id=1
So how can we figure out that they are actually the same URL really? The solution I came up with is a simple multi-step processing algo. It goes like this:
Take the query string variables and save them in an array. So in the case of our first URL, the array would contain the following key=>value pairs:
Next, sort the array by alphabetical order based on the keys names, like:
Now rebuild the URL based on the new order of the variables:
By now the trick should be clear: if you do that to all the URLs, you would always reach the same final re-composed URL as long as the variables are same (i.e. the same names and one URL doesn't have extra or missing variables).
I call this Query String Collapsing. Why "collapsing" instead of normalization or decomposition? No real reason apart from thinking about this as collapsing a whole slew of URLs into a single representative entity. And I just like that name more that way
With this, what can we do with analytics? Save both the original URL as requested and the collapsed URL. This opens up a nice set of funky things you can do, but that's another post...
Hot off the press: after the fuss raised by a bunch of us a few weeks ago, Donna now reports that Live ponies up about the referrer spam. They've issued a statement where they:
Microsoft also states that "Hopefully webmasters have also noticed these issues disappearing. If you are still experiencing any issues, please contact us before you block MSNBot, to see if we can address the issue."
Let me be the first to say a big thank you to Microsoft for making a very solid public response to the issue and answering our questions. This kind of transparency is exactly what fosters a good relationship between a search engine and webmasters.
And yes, Live.com team, I do default to your search engine for my searches. Works a treat (most of the time
).
Welcome Reddit, Digg, and StumbleUpon users! If you like this post, please vote below. Thank you!
Update 2: Yuri explains more background and asks What happens next?. Reuben Yau and Kichus have both blocked the IP addresses. Boy are people angry.
Update 1: DazzlinDonna from SEO Scoop has written an excellent background to this fiasco, and Michael VanDeMar is reporting that Microsoft is interfering with AdSense. Ouch.
The bot analysis continues, and this post presents evidence indicating that Microsoft is spamming websites. A big claim, I know, but I can't find a better explanation. You'll have to decide.
The summary: IP addresses belonging to Microsoft are requesting pages from eKstreme.com and blogSci.com (my science blog) with HTTP referer headers suggesting that the hits were from live.com searches. These referer headers are spoofed as the keywords from these supposed searches are sometimes in no way related to the requested page. Additionally, for most of the other supposed searches, the requested pages do not rank in the top 10 (first page of results) in a way to send this traffic.
For some odd reason, the webmaster community has known about this for a couple of months. In September, SE Roundtable posted about other webmasters complaining about this spam. Surprisingly, we also got official confirmation (via a WMW thread) from msndude that this indeed happening and it's (and I'm quoting) "part of a quality check we run on selected pages". This is an unacceptable explanation as you'll see from the data below because it has none of the hallmarks of a quality check but all the marks of referral spam.
The hits discussed below are extracted from the blogSci.com data to keep things simple, but a similar data set exists for eKstreme.com.
The whole list of hits is way too long to quote in full here, so here is a sampling of my favorite requests:
- At: 17 August 2007 05:53:27 PM GMT
- Routed to: /index.php
- Referred from: http://search.live.com/result.aspx?q=make+money+online&mrt=en-us&FORM=LVSP
- Remote: bl2sch1082213.phx.gbl [] (65.55.165.119)
- Request: HTTP/1.0 GET
- Accepting:
- HTTP: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
- Charset:
- Enconding:
- Languages: en-us
- UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
- Cookies:
- At: 18 August 2007 03:05:43 PM GMT
- Routed to: /index.php
- Referred from: http://search.live.com/result.aspx?q=make+money+online&mrt=en-us&FORM=LVSP
- Remote: bl2sch1082008.phx.gbl [] (65.55.165.66)
- Request: HTTP/1.0 GET
- Accepting:
- HTTP: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
- Charset:
- Enconding:
- Languages: en-us
- UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
- Cookies:
These two hits above are the first I have in my records. What's amusing about them is that both supposedly came from a search for [make money online].
- At: 19 August 2007 03:55:48 AM GMT
- Routed to: /index.php
- Referred from: http://search.live.com/result.aspx?q=ticket&mrt=en-us&FORM=LVSP
- Remote: bl2sch1081815.phx.gbl [] (65.55.165.25)
- Request: HTTP/1.0 GET
- Accepting:
- HTTP: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
- Charset:
- Enconding:
- Languages: en-us
- UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
- Cookies:
This one is also very random: a blog post about a cool new magnet-based technology to create colors is ranking in the top 10 for the query [ticket]? Not even Live.com generates such irrelevant results.
Anything more recent? Sure:
- At: 11 November 2007 03:26:43 PM GMT
- Routed to: /index.php
- Referred from: http://search.live.com/results.aspx?q=osteoporosis&mrt=en-us&FORM=LIVSOP
- Remote: bl2sch1081815.phx.gbl [] (65.55.165.25)
- Request: HTTP/1.0 GET
- Accepting:
- HTTP: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
- Charset:
- Enconding:
- Languages: en-us
- UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
- Cookies:
- At: 11 November 2007 03:29:24 PM GMT
- Routed to: /index.php
- Referred from: http://search.live.com/results.aspx?q=amazon&mrt=en-us&FORM=LIVSOP
- Remote: bl2sch1081909.phx.gbl [] (65.55.165.43)
- Request: HTTP/1.0 GET
- Accepting:
- HTTP: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
- Charset:
- Enconding:
- Languages: en-us
- UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
- Cookies:
At the time of writing, there are 245 such hits in my records since August 2007.
Not convinced? There is more. Some of these hits came within seconds of being indexed by MSNBot. The pattern is like this: the page is requested by MSNBot (which is authenticated, so it's genuine) and within a few seconds, the very same page is requested as described above with a live.com search are referer. An example:
- At: 10 November 2007 12:05:14 PM GMT
- Routed to: /index.php
- Referred from: (No referer.)
- Remote: livebot-65-55-209-143.search.live.com [] (65.55.209.143)
- Request: HTTP/1.0 GET
- Accepting:
- HTTP: text/html, text/plain, text/xml, application/*, Model/vnd.dwf, drawing/x-dwf
- Charset:
- Enconding: identity;q=1.0
- Languages:
- UA: msnbot/1.0 (+http://search.msn.com/msnbot.htm)
- Cookies:
- At: 10 November 2007 12:05:36 PM GMT
- Routed to: /index.php
- Referred from: http://search.live.com/results.aspx?q=problem&mrt=en-us&FORM=LIVSOP
- Remote: bl2sch1081810.phx.gbl [] (65.55.165.20)
- Request: HTTP/1.0 GET
- Accepting:
- HTTP: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
- Charset:
- Enconding:
- Languages: en-us
- UA: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
- Cookies:
The typical delay between the indexing request and the spoofed search hit request is 5-20 seconds.
Anyone staring at these hits long enough will see some signatures to detect them:
Needless to say, this smells like bot behavior.
Let's think about this for a minute: What on Earth is going? Why are these hits happening? I can think of two explanations:
I subscribe firmly to the first explanation: the search keywords are spammy in some cases, always too general, the requested pages never rank in the top 10 as the referring URLs would suggest, the hits have identical user agents (i.e. not the typical variation you would expect from various people using normal browsers on different operating systems withing the same company to show) and the actual referring URL does not match what a human being searching on live.com generates.
In short: it's spam and not a quality control check. What do you think?
Seriously folks, why isn't Google's adopted child, the Apple's iPhone, not part of the Open Handset Alliance. And Microsoft?
Anyone want to take bets that MS and Apple create their own alliance? That would be fun.
A thread at SEO Refugee started with asking for help about a funny URL, which we deciphered to be a probe to try to attack the website. I've seen this before and so I suggested that they block the whole IP C-block. Which turned into the question of: how do you block IP addresses?
The way I do it is using PHP and it uses a nice little trick that few seem to know about. This is how:
PHP has a feature that allows you to pre-pend a file at every PHP request. This prepend file is the equivalent of having it include()ed at the top of every single PHP script on your site. It's is done through a directive that is set either in php.ini or .htaccess. The directive is called auto_prepend_file. For .htaccess, this is what I use:
php_value auto_prepend_file "/full/path/to/a/prepend-file.php"
Because it runs at every PHP request and it runs before the actual requested script, you can do some really neat things. So what do I do? I'm developing this system internally and at the moment it does three things:
The blocking works as follows: there is a special directory where I put empty files that dictate the blocking. The file names are of two formats: a.b.c.d or a.b.c depending if want to block a specific IP address (the former format) or a C-block (the latter). In the pre-pend file, there is a simple check: figure out the remote IP address, and check the for the presence of either its file or its C-block file. So if the remote IP is 111.222.333.444, it checks for the prsence of either /111.222.333 or /111.222.333.444. If either exist, a 403 not authorized header is returned and the code exit()s, so no actual content gets displayed.
This raises the question: how do you add files to the directory? Using a web interface of course
You can do it with a simple touch() or an fopen().
For completeness, there is a sister directive called auto_append_file which runs after each PHP script is called (with the exception that if script exit()s, the append file doesn't run). I've never used it, but it can be useful for things like measuring how quickly scripts run on your server.
Today is a great day to be... not Google. As someone who's spoken out many times against Google and its practices, I'm very happy today. This is not a simple ranting post, so please bear with me as I explain away the smile. Two reasons to be smiling:
First the PR update. There is no evidence to suggest that this is due to link selling. There is no evidence to suggest it's a vandetta against people who've spoken against Big G. It could be a glitch, but my favorite theory: ALL toolbar PR will drop to zero. Why? The publicly visible PR has always been a thorn in Google's back: everyone watches it, people use it to assert authority, people try to manipulate it, and worst of all, it created a market because it's in a finite supply.
I also think the visible PR kick-started a mentality of not linking. When it started, it was "I don't want PR to leak" and so people said "leak it for $$$". People saw that this was working well as a money source and worth it as an investment for both traffic, PR, and rankings. So Google responded, half-heartedly, by delaying the publication of the true PR but a few months. That was annoying but it worked. What it also meant that on some (most?) valuable sites, getting a link was virtually impossible without forking over some cash.
Next came the mangling of nofollow. It first started as this innocent anti-spam measure. Yeah, right. As if that was ever going to deter spammers (just ask the email spam filter companies). If anything, that made spammers get more creative and we're still happily getting spammed. So now we had this kinda useless tool until some genius figured it out: it can manage the flow of PR in a site. Heck, I can now even link out to other sites I wouldn't have because I can tell Google to ignore the link. Think about that: I am actively linking to a website while at the same time sending a message to Google that I don't trust said site. Hypocrisy at it's finest, all thanks to Google.
As you can imagine, this is not a stable situation. Eventually, a drastic measure would be needed to fix it. The most obvious one? Kill the visible PR. Take out the symptoms and the disease. What happens if you make all PR in the world zero? It will become useless. You take away the commodity that's being traded. And hopefully, you'll save the net from the mess you created by freeing people about worries of linking to each other again. The corollary to that is nuking nofollow, which I honestly believe is also required. However, let me be the first to note there is no evidence of this happening. None.
So let us for a moment assume that today's PR update (which hit eKstreme.com, by the way) is not a glitch. Let's assume it's a planned move, which would include a PR algo update. What does it mean?
Anyway, I've rambled on this too much. On to the Google-Microsoft-Facebook love triangle.
Microsoft was always the most likely winner since they already had a partnership with Facebook. As TechCrunch put it, it is the path of least resistance. Also, we should note that this was never about the money: it's a political win and a winning of mindshare. Also, whoever won would officially become the most spying, prying, ad agency on the web - in the world! Google's stated goal is to target advertising based on search history and other personal data. They call it 'personalization'. Everyone kicked up a fuss about how naughty this is, so much so, that when Google wanted to buy DoubleClick, the cries became louder: Google has too much creepy oversight over us.
Still, whatever info Google had, it's nothing compared to what Facebook has - in relation, Google's knowledge of me is harmless compared to what I have in my FB profile. Imagine an ad agency: would you like to target 25-30 year old males in a relationship with a woman and having liberal political and religious views? Facebook will say "Sure! We got some of those!" Google has nothing to say about that.
The level of targeting can be ridiculous: wanna target people who may have missed someone's birthday? How about those who were recently in a relationship but are not anymore? What about those who just entered a relationship? How about accurate geotargeting? All this info and much more is readily available through Facebook, and now Microsoft has access.
So all in all, it's an exciting and eventful day, but looking at the potential privacy worries, it doesn't bode well for the future. Whatever happens, today is very likely to go down in history as a turning point for both Google and Microsoft.
While I'm rambling, I'll finish with some predictions:
All thoughts welcome below
Yesterday, I posted my 2000th post on Cre8asite Forums where I moderate. The post was about writing tools (again!) but this time with a twist: the tool is released under a BSD license, meaning it can be downloaded and used as you see fit. The project, called Open Keyword, is a keyword generation tool for SEO purposes.
Open Keyword has two components:
There are two key things about Open Keyword that make it unique:
And since Open Keyword is open source, anyone can write improvements. If you do, please share them with me and I'll incorporate them so that everyone benefits.
So all you have to do now is go to the Open Keyword home page and download it. Full instructions on setting it up are also on the page. And if you get stuck gimme a shout. Comments and thoughts below please
I've been thinking about using Arabic in URLs, a question asked by Rand Fishkin of SEOmoz over at Cre8. Rand's question was:
What if you are optimizing in the Arabic character language set and want to include "keywords" in your URL
As an Arabic speaker and user of Arabic websites, I feel I can help answer this one. The answer is applicable to other languages as it deals with technical issues faced by all non-English language. Arabic is merely the language we draw specific examples from. So here goes...
URLs are allowed only a certain set of characters for them to work: the English alphabet (both lowercase and uppercase), the numbers, dashes, dots, forward slashes, and the question mark, and a few others. These chosen few of characters are based on American English as defined by the ASCII standard for historical reasons. All other characters, like English punctuation and non-English characters have to encoded.
The question needs to be answered for domain names too. Wikipedia has a nice summary of international domain names that allow non-ASCII characters in them. However, support for that is not universal yet, and as we'll see later, different browsers will handle internation domain names differently. For now, I would recommend steering clear of these for SEO purposes.
OK, so we know that non-ASCII characters have to be encoded, and so what about Rand's question about keywords in the URL? This raises a very interesting question: If you know the URLs are going to be encoded, doesn't usability dictate that you use non-encoded text? So given the choice between these two URLs:
which one would you choose? Is the presence of the word in the URL really that key for ranking?
I would argue that in this case, for usability's sake, I would go for something like:
Then the actual page contents will be in Arabic or any other language. The anchor text is also key, so in-site optimization becomes super-critical, not to mention on-page techniques.
The other thing to consider is how users input URLs. Do they type them? How important is type-in traffic for the site under question? Most likely, people will type the domain name in English. Speaking of which, try this site: search that points to an Arabic domain name (see PS below as to why I'm link to Google to give the domain name) and watch the URL in different browsers. Safari keeps the URL as it is, but Firefox and IE 7 change it to http://xn--ugb6bax.com/. That last URL is certainly not memory-friendly.
So to sum up: I would be careful how I use non-English letters in URLs.
Back in May, Google came out with the all-singing-all-dancing-Ask.com-copycat called Universal Search. Buried in the announcements is a little gem called Cross-Language Information Retrieval (also see this). From SEL's post:
Search queries will be entered in the native language, translated into English and run against Google's index. Any retrieved pages/sites will then be translated from English back into the native language.
I'm sure this will affect the kind of SEO we're talking about here, but I haven't done any tests to see how and how much. Anyone got data?
Another thing you will notice is anglicized Arabic (Arabic written in English): sometimes you'll see numbers in the middle of the words. This is because there is a colloquial transliteration system developed over the past decade or so (thanks to mobile phones and the internet!) to write Arabic-only letters in English. Example: Arabic has two H-like characters. There is one pronounced as the H in Henry and one as more deeper, sounding almost like an H you would if you have a scratched throat. The Henry-like H is transliterated into H in English, and the second type of H is transliterated into a 7.
How the search engines handle (parse and index) such transliterated text is a very big question. A quick search for [7mar] (donkey in Arabic :), which is used as an insult along the lines of stupid or moron) shows that quite a few pages are indexed with that "word" in it. Interestingly, Google thinks these pages are English, and if you do an Arabic search specifically, you get another set of SERPs suggesting that unless explicitly told it's Arabic, Google at least will get confused.
There is more to this story, as part of another bigger story, but that's for another post. In the meantime, please post questions and comments below
PS - Why isn't there a word of Arabic in this post? It's because WordPress thinks it's the best piece of software in the world and keeps editing my Arabic into question marks. Using either IE or FF does not fix this problem.
The most popular pages on eKstreme.com.