Google: Block Duplicate Content & Don't Block Duplicate Content

Apr 13, 2011 - 8:16 am 10 by

duplicate twinsA very active SEO in the Google forums posted a thread at the Google Webmaster Help forums asking why does it seem Google is contradicting itself with their advice on how to handle duplicate content.

He points out two different help documents:

(1) Google-friendly sites #40349:

Don't create multiple copies of a page under different URLs. Many sites offer text-only or printer-friendly versions of pages that contain the same content as the corresponding graphic-rich pages. To ensure that your preferred page is included in our search results, you'll need to block duplicates from our spiders using a robots.txt file.

(2) Duplicate content #66359:

Google no longer recommends blocking crawler access to duplicate content on your website, whether with a robots.txt file or other methods. If search engines can't crawl pages with duplicate content, they can't automatically detect that these URLs point to the same content and will therefore effectively have to treat them as separate, unique pages. A better solution is to allow search engines to crawl these URLs, but mark them as duplicates by using the rel="canonical" link element, the URL parameter handling tool, or 301 redirects.

So which is it? Do we block the duplicate content or not? With Panda rolling out globally and Google giving advice to remove duplicate content and non-original content, what is one to do?

New Googler Pierre Far passed this on to Googler JohnMu who replied:

Just to be clear -- using a robots.txt disallow is not a recommended way to handle duplicate content. By using a disallow, we won't be able to recognize that it's duplicate content and may end up indexing the URL without having crawled it.

For example, assuming you have the same content at: A) http://example.com/page.php?id=12 B) http://example.com/easter/eggs.htm

... and assuming your robots.txt file contains: user-agent: * disallow: /*?

... that would disallow us from crawling URL (A) above. However, doing that would block us from being able to recognize that the two URLs are actually showing the same content. In case we find links going to (A), it's possible that we'll still choose to index (A) (without having crawled it), and those links will end up counting for a URL that is basically unknown.

On the other hand, if we're allowed to crawl URL (A), then our systems will generally be able to recognize that these URLs are showing the same content, and will be able to forward context and information (such as the links) about one URL to the version that's indexed. Additionally, you can use the various canonicalization methods to make sure that we index the version that you prefer.

But is Google still contradicting themselves in those two help articles?

Forum discussion at Google Webmaster Help.

Image credit to JUNG HEE PARK on Flickr.

 

Popular Categories

The Pulse of the search community

Follow

Search Video Recaps

 
Google Core Update Rumbling, Manual Actions FAQs, Core Web Vitals Updates, AI, Bing, Ads & More - YouTube
Video Details More Videos Subscribe to Videos

Most Recent Articles

Search Forum Recap

Daily Search Forum Recap: March 18, 2024

Mar 18, 2024 - 4:00 pm
Google Updates

Google Urges Patience As The March 2024 Core Update Continues To Rollout

Mar 18, 2024 - 7:51 am
Google

Official: Google Replaces Perspective Filter With Forums Filter

Mar 18, 2024 - 7:41 am
Google Maps

Google Business Profiles Now Offers Additional Review After Appeal Is Denied

Mar 18, 2024 - 7:31 am
Google Maps

EU Searchers Complaining About Google Maps Features Changes Related To DMA

Mar 18, 2024 - 7:21 am
Google

Google Showing Fewer Sitelinks Within Search

Mar 18, 2024 - 7:11 am
Previous Story: The Google Locomotive Train Logo: Richard Trevithick