Duplicate Content: How to solve the problem
In the previous article “Duplicate Content: the effects on Search Engine Rankings” we have explained what duplicate content is and we have analyzed how it can affect the search engine rankings of our website. In this article we will focus on the best practices that can be used in order to solve the duplicate content problem and we’ll examine the technical aspects of the issue.
What is the root of the duplicate content problem?
As we saw in the previous post duplicate content comes from submitting multiple times the same content in different pages or websites, from using incorrectly multiple domain names and from using incorrect web development and SEO practices.
In the first case, the problem is usually caused by the webmasters who try to promote their websites by posting the same articles, texts or press releases in multiple websites. Additionally this could be the result of an incorrect link building strategy in which SEOs try to increase the number of backlinks by submitting the same content to multiple sources. Thus in this case, the root of the duplicate content problem is the user who tries to promote his website with grayhat or blackhat techniques.
In the second case, the problem is caused by companies that acquire and use multiple domain names for the same website. For example by pointing incorrectly the example.com and the example.co.uk to the same website it is certain that one will face duplicate content issues. Thus again in this case, the root of the problem is the webmaster or the web development company that does not know how to setup correctly the 301 redirections and that does not use the best web techniques.
The third case is much more interesting and technical. The root of the problem is that the HTTP protocol does not provide a standard way to identify the “best” URL of a particular page. This means that one page can be accessed by multiple valid URL addresses and at the same time no information is available about the canonical URL.
All the above URLs could lead to the same page, but the HTTP protocol will neither point out the “best” one nor guarantee that all the above addresses are directed to the same page. So in the above example the http://www.example.com and the http://www.example.com/index.html could either lead to the same or to 2 completely different pages.
Also we need to have in mind that there are lots of different languages (PHP, ASP.NET, ASP, JSP, CGI, ColdFusion, etc) and web technologies that can be used in order to support dynamic websites. Due to the fact that the various web technologies support different features (default vs index pages, case sensitive vs case insensitive etc) the situation gets more complicated.
All the above difficulties make it easy for someone who does not understand how search engines work to make mistakes in the link structure of the website and affect the SEO campaign. So the question is how can we avoid those mistakes?
How to solve the duplicate content problem
In the first case the solution is relatively easy. All you need to do is to avoid submitting the exact same content to multiple sources and always make sure you use whitehat SEO techniques. Make sure you prepare different versions for the same article or press release so that search engines will not consider it as duplicate. This will help you build better links and generate more traffic.
In the second case, when there is the need of acquiring multiple domain names for the same website make sure you select only one primary domain and setup HTTP 301 redirections for the rest. So say for example that for a particular website you use 2 domain names: example.com (primary) and example.co.uk (secondary). Then what you want to do is to setup a 301 redirection to the example.co.uk so that whenever someone types this domain he/she will be redirected to the example.com. There are several ways to do this (DNS settings, .htaccess, PHP/ASP/JSP redirection etc), but the most straight forward is by modifying the DNS settings from the panel of your domain provider.
The third case is a bit more complicated. As we said in the previous article, Search Engines do take steps to minimize the effect of the duplicate problem by identifying the best URL for a particular page. They use a set of rules that are applied in the URLs in order to identify the best possible version (for example the trailing / is added after “.com”, the domain name is converted to lowercase, they determine whether to use the www or the non-www version etc). After that they are forced to visit the different URLs and analyze the pages in order to determine whether they are duplicate or unique. Nevertheless even if search engines do try to solve the issue, the SEO campaign can be affected and thus it is highly recommended working on your link structure in order to eliminate those problems.
Working on your link structure
So what you want to do is to make sure that all the links of your site point to the best URLs and that there are no situations where 2 different URLs lead to the same page.
Here is a list of the most important things you should look out:
- Remove all the URL variables that are not important from all the links (SESSIONIDs, empty variables etc).
- Decide whether to use the www or the non-www version for your site and place a 301 redirection to the other version.
- Decide whether to use the “index.html” in the URL when you point to the default page of a folder
- Add the trailing / at the end of each folder.
There is a great article by Ian Ring on this subject so I am not going to discuss it further. Make sure you read his article “Be a Normalizer – a C14N Exterminator” and also the Wikipedia article on URL normalization. All these tips can help you optimize the link structure of your website and this is going to help you solve the major duplicate problems.
Another great way to solve the problem is by using 301 redirections. Especially in cases where the incoming links of a particular page are divided between the various duplicate versions of the page it is highly recommended to use the above rules in order to select the “best” URL and then setup 301 redirections to the rest of the pages. This can be done easily by using either the .htaccess file or a programming solution (PHP redirection).
When working on the link structure of the website is not an option there is an alternative called Canonical Tag. The Canonical Tag was proposed by Google in order to resolve the duplicate content issue.
To be precise it is not a tag but a value of the attribute rel of the <link> tag:
<link rel=”canonical” href=”http://www.example.com/product.php?item=swedish-fish” />
It is placed in the HTML headers in order to notify the search engines about the best URL for the particular page. Using the canonical in your pages is something very useful and it can help you reduce drastically the amount of duplicate content within your site. Additionally it is a great way to pass the link juice that is lost to the duplicates back to the canonical pages.
Keep in mind that this tag is only a hint, not a directive for the major search engines. In order to use it properly the URLs (both the canonical and the duplicates) must be almost identical.
For more information about Canonical URLs check out the article of Matt Cutts SEO advice: url canonicalization and the article of the official Google Blog Specify your canonical.
The best methods to solve the duplicate content problem
As we said above, there are several ways to solve the problem. Here is the list of the methods that you should use (Note that it is highly recommended to try solving the problem by using the first 3 ways):
- Work on your link structure
- Use 301 redirections
- Use canonical tag
- Exclude parameters such as sessionIDs from Google Webmaster Tools Console
- Last resort, block the duplicate content with robots.txt