Google: Block Duplicate Content & Don't Block Duplicate Content

Apr 13, 2011 • 8:16 am | comments (10) by twitter Google+ | Filed Under Google Search Engine Optimization
 

duplicate twinsA very active SEO in the Google forums posted a thread at the Google Webmaster Help forums asking why does it seem Google is contradicting itself with their advice on how to handle duplicate content.

He points out two different help documents:

(1) Google-friendly sites #40349:

Don't create multiple copies of a page under different URLs. Many sites offer text-only or printer-friendly versions of pages that contain the same content as the corresponding graphic-rich pages. To ensure that your preferred page is included in our search results, you'll need to block duplicates from our spiders using a robots.txt file.

(2) Duplicate content #66359:

Google no longer recommends blocking crawler access to duplicate content on your website, whether with a robots.txt file or other methods. If search engines can't crawl pages with duplicate content, they can't automatically detect that these URLs point to the same content and will therefore effectively have to treat them as separate, unique pages. A better solution is to allow search engines to crawl these URLs, but mark them as duplicates by using the rel="canonical" link element, the URL parameter handling tool, or 301 redirects.

So which is it? Do we block the duplicate content or not? With Panda rolling out globally and Google giving advice to remove duplicate content and non-original content, what is one to do?

New Googler Pierre Far passed this on to Googler JohnMu who replied:

Just to be clear -- using a robots.txt disallow is not a recommended way to handle duplicate content. By using a disallow, we won't be able to recognize that it's duplicate content and may end up indexing the URL without having crawled it.

For example, assuming you have the same content at: A) http://example.com/page.php?id=12 B) http://example.com/easter/eggs.htm

... and assuming your robots.txt file contains: user-agent: * disallow: /*?

... that would disallow us from crawling URL (A) above. However, doing that would block us from being able to recognize that the two URLs are actually showing the same content. In case we find links going to (A), it's possible that we'll still choose to index (A) (without having crawled it), and those links will end up counting for a URL that is basically unknown.

On the other hand, if we're allowed to crawl URL (A), then our systems will generally be able to recognize that these URLs are showing the same content, and will be able to forward context and information (such as the links) about one URL to the version that's indexed. Additionally, you can use the various canonicalization methods to make sure that we index the version that you prefer.

But is Google still contradicting themselves in those two help articles?

Forum discussion at Google Webmaster Help.

Image credit to JUNG HEE PARK on Flickr.

Previous story: Daily Search Forum Recap: April 12, 2011
 

Comments:

John Mueller

04/13/2011 12:29 pm

This is already fixed, by the way :-).

Barry Schwartz

04/13/2011 12:35 pm

You are quick!

googlemonopolyeu

04/13/2011 03:40 pm

This all leads to the inevitable answer that manual webmaster generated canonical URLs aren't necessary or even recommended by Google. Canonicals are touted by most SEO's as some clear indication about site penalty, when in fact, Google likely never truly utilized all that hard work people did. Canonicals were another half baked search engine created miracle solution, that failed to deliver. Google suddenly is smart enough to organize and determine dupe content. That's a good start, a decade late.

Ben Pfeiffer

04/13/2011 04:46 pm

It should be noted in Google's example they are using canonical tags on pages. If you are not using the canonical tag on your pages, and you are just blocking duplicate or thin content use Robots.txt or a meta noindex. The usefulness of Robots.txt is not dead just because Google says it is.

Michael Martinez

04/13/2011 04:58 pm

I think they are being necessarily complex, not contradictory. The complexity is necessary because they do have to see the document to understand fully what the Webmaster wants them to do with it. That's just an inefficiency built into the system. One possible solution to this issue might be for Google and Bing (and other search engines if they don't like duplicate content) to adopt an extension of the XML sitemap standard that allows the Webmaster to declare both duplicate and canonical URLs.

Jamie Low

04/13/2011 07:46 pm

@John Mueller In light of this update, does Google have an official recommendation for dealing with PDF files?

John Mueller

04/14/2011 09:18 am

You can redirect the URLs of your PDF files to the preferred URL for your PDF files (assuming you have the same PDF with multiple URLs). If you have PDF files as alternate versions of the same content, then that wouldn't be seen as duplicate content -- it's quite different to have a PDF or a HTML page, so I'd recommend letting both be crawled & indexed. If you don't want the PDFs indexed, a simple solution could be to use the noindex x-robots-tag HTTP header when serving them.

John Mueller

04/14/2011 09:22 am

It's always good to use proper canonicalization for duplicate content additionally. Using a robots.txt disallow does not fix canonicalization, it just makes it impossible for search engines to handle it on their own (or through the normal methods like 301 redirects or the rel=canonical link element). This doesn't meant that the robots.txt is not useful at all, it's just not useful for canonicalization of duplicate content.

Luke Jones

04/14/2011 10:40 am

Google's documentation is all over the place. Over the past few weeks it seems I've been spending a lot of time looking at their help documentation amongst other things and there has been so much contradiction it's untrue. I'm hoping that at some point they will begin to reformat all of their help documents, removing any inconsistencies but it's a really big ask for a company handling so many different languages etc.

Tiggerito

04/15/2011 02:33 pm

Think of this from another angle: Say you have 10 pages that repeat the same content. Being identical means that are directly competing with each other in Google. Not an issue in itself. You start getting backlinks to them all, maybe each get 10. So you end up with ten pages with ten backlinks, all targeting the same market. Wouldn't it be better to have one page with 100 backlinks? The first thing is not to duplicate of you don't have to. If you do, then 301 redirects or canonical tags can help you merge them. robots.txt....check your reference again, I think they read this post :-)

blog comments powered by Disqus