Have you ever noticed that when you type in www.twitter.com in the address bar, after Twitter loads, it will be changed to http://twitter.com? In the same way, if you type in bbc.co.uk in address bar, after BBC gets loaded, the URL will be http://www.bbc.co.uk. If you try to browse to microsoft.com you will get an entirely different URL in the form: "http://www.microsoft.com/en/us/default.aspx". You find that certain websites strip off your 'www' prefix, while certain others add it. Yet another set of websites redirect you to entirely different looking URLs.
In this post, we will see more about this aspect of search engine optimization, called URL canonicalization or URL normalization. We will see why it is important and why being knowledgeable about it will reduce duplicate content issues.
Need for Canonicalization
As you have seen above, the following URLs may all be the same in terms of content:
http://website.com
http://www.website.com/index.html
http://www.website.com/default.aspx
http://website.com/default.aspx
But they are different URLs in the eyes of search bots. Which means, search engines will crawl them independently looking for different content, while ending up fetching the same content. It causes your site be crawled four times for the same content. You can imagine the load it will cause on your server. On the top of that, it will cause your website be recognized as a copy domain. Another effect is that all of these four variants will have different PageRank in Google index.
This is why we need to do canonicalization of URLs, which is standardizing a general URL structure for the entire domain.
Steps to Canonicalize URLs
Here are the steps you follow to canonicalize your URLs.
1. Choose WWW or Non-WWW form
First step is of course choosing whether you want WWW version or non-WWW version for your URLs. If you are using a self-hosted blog, then you can choose whichever you like. On the other hand, if you are hosting your blog on Blogger or WordPress free domain or TypePad, then the non-WWW version is what you need to choose. All the URLs in these services follow the non-WWW version by default and you do not have access to the server to change it.
The first step is to set up a 301 redirect from the non-preferred version to the preferred one. Setting up 301 redirect is beyond the scope of this post; for a detailed description for various servers, please see here. However, this is not important in case of Blogger or WordPress, since you don't have server-level access.
Now, you have to choose the preferred version at Google Webmaster Tools (not important if you have set the 301 redirect). Log into Webmaster tools with your Google account, Go to Settings [in the latest changes, settings holds most of the settings], and choose the preferred domain as whichever version you have chosen. This is enough for your Blogger/WordPress domain. Sadly, this feature is not available yet in Yahoo or Live search.
2. Use the Canonical URL Everywhere in the Domain
Now, you have to show the search engines that you prefer one version of the URL to the other. For this purpose, you need to use the preferred version across your domain, and recommend other people to follow it. For instance, if you check my internal links, you will see that I always strip off the 'www' part. I recommend other people that link to my blog to strip off the 'www' part as well. This way, I can be sure that search engines will pick up the preferred version and display it.
3. Remove Default File Names From URLs
Depending on your web server, you will have default file names (like 'index.html' or 'index.shtml' in Apache server and 'default.aspx' in Microsoft IIS), which means, when a visitor comes to your site's directory, he will be actually seeing "http://website.com/default.aspx".
You need to change this aspect and strip off the file name from the URL.
4. Minor Canonicalization Issues
There are more to canonicalization of URLs, such as keeping URLs entirely in lower case, keeping or not keeping the slashes at the end (http://website.com/ vs. http://website.com), removing fragments (#title), etc. More on this can be read in the paper by Tim Berners-Lee (the founder of world wide web) and others, published at the Internet Engineering Task Force (IETF): http://tools.ietf.org/html/rfc3986.
Conclusion
Make sure you read the related resources section below to know more on this. Keep a single URL structure across your domain, and encourage anyone linking your site to follow this. This way, you will keep your website search-engine-friendly.
Related Resources
URL canonicalization FAQ at MattCutts.com
Setting up 301 redirects on various servers
Copyright © Lenin Nair 2008
In this post, we will see more about this aspect of search engine optimization, called URL canonicalization or URL normalization. We will see why it is important and why being knowledgeable about it will reduce duplicate content issues.
Need for Canonicalization
As you have seen above, the following URLs may all be the same in terms of content:
http://website.com
http://www.website.com/index.html
http://www.website.com/default.aspx
http://website.com/default.aspx
But they are different URLs in the eyes of search bots. Which means, search engines will crawl them independently looking for different content, while ending up fetching the same content. It causes your site be crawled four times for the same content. You can imagine the load it will cause on your server. On the top of that, it will cause your website be recognized as a copy domain. Another effect is that all of these four variants will have different PageRank in Google index.
This is why we need to do canonicalization of URLs, which is standardizing a general URL structure for the entire domain.
Steps to Canonicalize URLs
Here are the steps you follow to canonicalize your URLs.
1. Choose WWW or Non-WWW form
First step is of course choosing whether you want WWW version or non-WWW version for your URLs. If you are using a self-hosted blog, then you can choose whichever you like. On the other hand, if you are hosting your blog on Blogger or WordPress free domain or TypePad, then the non-WWW version is what you need to choose. All the URLs in these services follow the non-WWW version by default and you do not have access to the server to change it.
The first step is to set up a 301 redirect from the non-preferred version to the preferred one. Setting up 301 redirect is beyond the scope of this post; for a detailed description for various servers, please see here. However, this is not important in case of Blogger or WordPress, since you don't have server-level access.
Now, you have to choose the preferred version at Google Webmaster Tools (not important if you have set the 301 redirect). Log into Webmaster tools with your Google account, Go to Settings [in the latest changes, settings holds most of the settings], and choose the preferred domain as whichever version you have chosen. This is enough for your Blogger/WordPress domain. Sadly, this feature is not available yet in Yahoo or Live search.
2. Use the Canonical URL Everywhere in the Domain
Now, you have to show the search engines that you prefer one version of the URL to the other. For this purpose, you need to use the preferred version across your domain, and recommend other people to follow it. For instance, if you check my internal links, you will see that I always strip off the 'www' part. I recommend other people that link to my blog to strip off the 'www' part as well. This way, I can be sure that search engines will pick up the preferred version and display it.
3. Remove Default File Names From URLs
Depending on your web server, you will have default file names (like 'index.html' or 'index.shtml' in Apache server and 'default.aspx' in Microsoft IIS), which means, when a visitor comes to your site's directory, he will be actually seeing "http://website.com/default.aspx".
You need to change this aspect and strip off the file name from the URL.
4. Minor Canonicalization Issues
There are more to canonicalization of URLs, such as keeping URLs entirely in lower case, keeping or not keeping the slashes at the end (http://website.com/ vs. http://website.com), removing fragments (#title), etc. More on this can be read in the paper by Tim Berners-Lee (the founder of world wide web) and others, published at the Internet Engineering Task Force (IETF): http://tools.ietf.org/html/rfc3986.
Conclusion
Make sure you read the related resources section below to know more on this. Keep a single URL structure across your domain, and encourage anyone linking your site to follow this. This way, you will keep your website search-engine-friendly.
Related Resources
URL canonicalization FAQ at MattCutts.com
Setting up 301 redirects on various servers
Copyright © Lenin Nair 2008
Very informative. Keep it up. Thanks.
ReplyDeleteVery creative and educating.
ReplyDeleteHow about sites, many many, that fail to load if you dont type www?--I mean just try http://heartstroke.ca/ (a north American sample) for test and try a site from any site that is outside india.com/ (a South Asian sample)
Best wishes.
Thanks for your comment. Both your site examples worked for me without www.
ReplyDeleteWith IE or firefox?
ReplyDeleteIn both IE and Fx, Mohamed.
ReplyDelete