I've been thinking about using Arabic in URLs, a question asked by Rand Fishkin of SEOmoz over at Cre8. Rand's question was:
What if you are optimizing in the Arabic character language set and want to include "keywords" in your URL
As an Arabic speaker and user of Arabic websites, I feel I can help answer this one. The answer is applicable to other languages as it deals with technical issues faced by all non-English language. Arabic is merely the language we draw specific examples from. So here goes...
Talking in (en)code
URLs are allowed only a certain set of characters for them to work: the English alphabet (both lowercase and uppercase), the numbers, dashes, dots, forward slashes, and the question mark, and a few others. These chosen few of characters are based on American English as defined by the ASCII standard for historical reasons. All other characters, like English punctuation and non-English characters have to encoded.
The question needs to be answered for domain names too. Wikipedia has a nice summary of international domain names that allow non-ASCII characters in them. However, support for that is not universal yet, and as we'll see later, different browsers will handle internation domain names differently. For now, I would recommend steering clear of these for SEO purposes.
Usability trumps the day?
OK, so we know that non-ASCII characters have to be encoded, and so what about Rand's question about keywords in the URL? This raises a very interesting question: If you know the URLs are going to be encoded, doesn't usability dictate that you use non-encoded text? So given the choice between these two URLs:
- site.com/?page=%D8%AF%D9%84%D9%8A%D9%84
- site.com/?page=directory (Rand's question was about a directory as in DMOZ not as in folder)
which one would you choose? Is the presence of the word in the URL really that key for ranking?
I would argue that in this case, for usability's sake, I would go for something like:
- site.com/directory
- site.com/node/1
Then the actual page contents will be in Arabic or any other language. The anchor text is also key, so in-site optimization becomes super-critical, not to mention on-page techniques.
International domain names
The other thing to consider is how users input URLs. Do they type them? How important is type-in traffic for the site under question? Most likely, people will type the domain name in English. Speaking of which, try this site: search that points to an Arabic domain name (see PS below as to why I'm link to Google to give the domain name) and watch the URL in different browsers. Safari keeps the URL as it is, but Firefox and IE 7 change it to http://xn--ugb6bax.com/. That last URL is certainly not memory-friendly.
So to sum up: I would be careful how I use non-English letters in URLs.
What about CLIR?
Back in May, Google came out with the all-singing-all-dancing-Ask.com-copycat called Universal Search. Buried in the announcements is a little gem called Cross-Language Information Retrieval (also see this). From SEL's post:
Search queries will be entered in the native language, translated into English and run against Google's index. Any retrieved pages/sites will then be translated from English back into the native language.
I'm sure this will affect the kind of SEO we're talking about here, but I haven't done any tests to see how and how much. Anyone got data?
Anglicized Arabic
Another thing you will notice is anglicized Arabic (Arabic written in English): sometimes you'll see numbers in the middle of the words. This is because there is a colloquial transliteration system developed over the past decade or so (thanks to mobile phones and the internet!) to write Arabic-only letters in English. Example: Arabic has two H-like characters. There is one pronounced as the H in Henry and one as more deeper, sounding almost like an H you would if you have a scratched throat. The Henry-like H is transliterated into H in English, and the second type of H is transliterated into a 7.
How the search engines handle (parse and index) such transliterated text is a very big question. A quick search for [7mar] (donkey in Arabic :), which is used as an insult along the lines of stupid or moron) shows that quite a few pages are indexed with that "word" in it. Interestingly, Google thinks these pages are English, and if you do an Arabic search specifically, you get another set of SERPs suggesting that unless explicitly told it's Arabic, Google at least will get confused.
There is more to this story, as part of another bigger story, but that's for another post. In the meantime, please post questions and comments below
PS - Why isn't there a word of Arabic in this post? It's because WordPress thinks it's the best piece of software in the world and keeps editing my Arabic into question marks. Using either IE or FF does not fix this problem.