Duplicate Content, The Ultimate Guide

Not ranking high in Google? Chances are its duplicate content…
SEO Moz - SEO tools to help you rank higher
Contrary to what most people think, I think the most common cause of websites not ranking is duplicate content. Nearly every single website I ever have to do SEO for has some kind of duplicate content problem. It is so common that nearly all SEO’s have difficulties with it sometime, whether it is missing something such as a duplicated website on a development server, or the client copy and pasting text. I should also admit, I come up against this problem nearly every time and sometimes I do miss it, I’ve yet to find a super tool for detecting duplicate content – if anyone knows of one, then please let me know in the comments.

Removing duplicate content is so important as the effect can be huge, I see countless web pages being link built to all the time, but remaining stuck on page 3 or 4 – many people think that if they just get more links it will go up, but this usually isn’t the case.

The Most Common Causes Of Duplicate Content

This isn’t a complete list, I’ll no doubt add to it as time goes on, if you think of any other examples then please add them in the comments.

1. www vs non-www

Probably the most common problem is websites resolving with and without the “www” at the start. For example www.david-whitehouse.org and david-whitehouse.org showing the same website – this would likely cause a duplicate content problem. Instead one should 301 redirect to the other.

2. Multiple TLD’s

Another common cause is the two TLD’s showing the same website, for example www.duplicatecontent.com and www.duplicatecontent.co.uk – instead one should 301 redirect to the other, this problem can be made even worse when combined with the “www” problem above.

3. Multiple Domain Names

Two separate domains both showing the same website, a great example of this would be www.theolddeanery.co.uk and www.theoldeanery.co.uk, if both websites show the same website then you have problems, again this can be made worse combined with the “www” problem above. Instead one domain should be chosen and all other domains should 301 redirect to that one.

4. Copy and Pasting

Copy and pasting – this is a real bug bear of mine, and it is hard to track down. One of the biggest reasons you should be careful when outsourcing content writing. If someone copies a phrase around 8 words long or more from another website and puts it on your website, you’re going to get penalised. If you copy content that isn’t part of the template design from one part of your website to another, again you are going to get penalised. Don’t copy and paste anything, always write stuff from scratch.

5. Copy and Pasting… and Rewriting!

Copy and pasting, but rewriting – Ah yes, you thought you could get round the duplicate content problem by copy and pasting what someone else put and then re-writing it. Genius. Google can determine synonyms, so if you have the exact same sentence structure but with synonyms replaced, guess what. That’s right, duplicate content penalty. As I said above, don’t copy and paste anything, write everything from scratch, no rewriting.

6. Poor Categorisation By Software

This is quite comment, an example is on WordPress, by default the blog posts you write are displayed on the blog’s front page, the category page, the tag page the archive page and the blog post itself! This naturally causes problems, the best solution for this is to follow Yoast’s WordPress SEO guide.
Another example is on Magento, I created a website in Magento last year for a friend of mine, the problem arose when I noticed that an identical category could be displayed under a number of different URLS. Look at these two (Philips 4300k bulb, duplicate) for example. It has the REL canonical tag, but I’m not sure if this solves the problem – I’ve made recommendations to get this adjusted.

7. Poorly Implemented Search Engine Friendly URLs

Sometimes you get websites that implement search engine friendly URLs, but they do it so you can access the same page with multiple URLs. The problem with this is that you can end up changing the search engine friendly part to anything, as long as you keep part of it the same – this naturally causes a duplicate content problem. Instead you should include a default URL for each ID and if it’s not the correct one then it should 301 redirect to the original.

Incorrect Implementation
Original: http://www.electricfirestore.co.uk/mall/productpage.cfm/TheElectricFireStore/_WALLMO002/369091
Modified: http://www.electricfirestore.co.uk/mall/productpage.cfm/TheElectricFireStore/_WALLMO002/anyoldrubbish

See the difference in the URLs?

Correct Implementation
Original: http://www.wineguppy.com/rcroft-crystal-wine-decanter-s/64.htm
Modified: http://www.wineguppy.com/anyoldrubbish/64.htm

I still wouldn’t recommend setting up search engine friendly URLs this way, but if you have to this is the correct way of doing it. As you can see, this method allows manipulation of the URLs but redirects to the correct one.

8. Development / Staging Servers

Development servers – often when your website is built there is a development server that can also be seen by the public, this is a common case of duplicate content. Instead the server should only be accessible via certain IP addresses or should be password protected to prevent this problem or you can disallow all robots access by using this piece of code in the robots.txt file:

User-agent: *
Disallow: /

9. Scraping & Feed Syndication

If you have a blog and it has an RSS feed, often people can take that and use it to populate a website, creating duplicate content. This can really cause problems. One way round this is by only showing a partial feed or perhaps having a footer on each post with a link back to your site in the RSS feed. A great WordPress plugin for this is called RSS Footer (again by Yoast).

How To Prevent Duplicate Content

1. 301 Redirects

Got one page accessible from two URLs? Simply redirect one of the URL’s by adding a 301 redirect, usually done by adding it to your .htaccess file (if you’re using Apache).

Here is the syntax for redirecting http://www.example.com/duplicate-page to http://www.example.com/original-page:

redirect 301 /duplicate-page http://www.example.com/original-page

Or for the more advanced user, you can do a .htaccess redirect with a regular expression (usually for moving your blog about).

2. REL Canonical Tag

Realistically all pages should have a REL canonical tag on, the best place to learn about this is the Google help topic itself

Basically its a way of saying that this page isn’t the original and pointing to where the original is – it helps prevent some duplicate content problems, particularly in situations where you can’t 301 redirect something.

Example

Tag On http://www.example.com/original-page

<link rel=”canonical” href=”http://www.example.com/original-page”/>

Tag On http://www.example.com/duplicate-page

<link rel=”canonical” href=”http://www.example.com/original-page”/>

As you can see this would ensure Google knows that the duplicate page is actually the same as the original page and so it should not be considered as duplicate content.

Listen to this video by Matt Cutts to get a better understanding of how it works:

3. Robots Noindex Tag

Basically this is a way of telling Google not to index the current page. You can tell Google to do this in two ways.

A) Robots.txt
For example, if you don’t want Google to index your duplicate page, you would put the following in your robots.txt if you duplicate page is located at www.example.com/duplicate-page:

User-agent: *
Disallow: /duplicate-page

B) Meta Robots Meta Tag

<meta name=”robots” content=”noindex”>

This should go in between the <head></head> tags on the relevant page.

4. Unique Content

I shouldn’t really need to spell this out for you, don’t copy anything! Two things you can use to identify duplicate content, first one is Google itself, simply by copy and pasting a small sentence and seeing how many results come up. The other way is by using Copyscape (I typically use Google, I’m not sure how much you can rely on Copyscape, so do so at your own risk)!

5. Choose A Half Decent Platform

To be honest this is easier said than done, otherwise duplicate content wouldn’t be such a major issue for a lot of websites. A lot of these problems can’t be fixed on many software packages, or if they can be fixed it is normally a big pain in the ass, Magento springs to mind in particular. But it isn’t just the open source free ecommerce packages that are bad, lots of blog software is and also a lot of smaller web development companies that sell an ecommerce package out of the box also cause duplicate content. This is why the problem is so widespread and why so many websites suffer from duplicate content penalties.

Well I hope you found this guide useful, if I’ve missed anything or you want to add suggestions please be sure to a comment, also if you find some good examples of content please feel free to add them here.