Google: Block Duplicate Content & Don't Block Duplicate Content

Apr 13, 2011 - 8:16 am 10 by

duplicate twinsA very active SEO in the Google forums posted a thread at the Google Webmaster Help forums asking why does it seem Google is contradicting itself with their advice on how to handle duplicate content.

He points out two different help documents:

(1) Google-friendly sites #40349:

Don't create multiple copies of a page under different URLs. Many sites offer text-only or printer-friendly versions of pages that contain the same content as the corresponding graphic-rich pages. To ensure that your preferred page is included in our search results, you'll need to block duplicates from our spiders using a robots.txt file.

(2) Duplicate content #66359:

Google no longer recommends blocking crawler access to duplicate content on your website, whether with a robots.txt file or other methods. If search engines can't crawl pages with duplicate content, they can't automatically detect that these URLs point to the same content and will therefore effectively have to treat them as separate, unique pages. A better solution is to allow search engines to crawl these URLs, but mark them as duplicates by using the rel="canonical" link element, the URL parameter handling tool, or 301 redirects.

So which is it? Do we block the duplicate content or not? With Panda rolling out globally and Google giving advice to remove duplicate content and non-original content, what is one to do?

New Googler Pierre Far passed this on to Googler JohnMu who replied:

Just to be clear -- using a robots.txt disallow is not a recommended way to handle duplicate content. By using a disallow, we won't be able to recognize that it's duplicate content and may end up indexing the URL without having crawled it.

For example, assuming you have the same content at: A) http://example.com/page.php?id=12 B) http://example.com/easter/eggs.htm

... and assuming your robots.txt file contains: user-agent: * disallow: /*?

... that would disallow us from crawling URL (A) above. However, doing that would block us from being able to recognize that the two URLs are actually showing the same content. In case we find links going to (A), it's possible that we'll still choose to index (A) (without having crawled it), and those links will end up counting for a URL that is basically unknown.

On the other hand, if we're allowed to crawl URL (A), then our systems will generally be able to recognize that these URLs are showing the same content, and will be able to forward context and information (such as the links) about one URL to the version that's indexed. Additionally, you can use the various canonicalization methods to make sure that we index the version that you prefer.

But is Google still contradicting themselves in those two help articles?

Forum discussion at Google Webmaster Help.

Image credit to JUNG HEE PARK on Flickr.

 

Popular Categories

The Pulse of the search community

Follow

Search Video Recaps

 
Google Core Update Coming, Ranking Volatility, Bye Search Notes, AI Overviews, Ads & More - YouTube
Video Details More Videos Subscribe to Videos

Most Recent Articles

Search Forum Recap

Daily Search Forum Recap: July 19, 2024

Jul 19, 2024 - 10:00 am
Search Video Recaps

Search News Buzz Video Recap: Google Core Update Coming, Ranking Volatility, Bye Search Notes, AI Overviews, Ads & More

Jul 19, 2024 - 8:01 am
Google Search Engine Optimization

Billions Of Google goo.gl URLs Will No Longer Work

Jul 19, 2024 - 7:51 am
Google Search Engine Optimization

Google: ccTLDs & Language Do Help You Rank A Little Better In Local Country Region

Jul 19, 2024 - 7:41 am
Google Search Engine Optimization

Google's On Knowing If Your SEO Team Is Doing Their Job

Jul 19, 2024 - 7:31 am
Google Ads

Google Merchant Center Next Gains Support For Supplemental Feeds

Jul 19, 2024 - 7:21 am
Previous Story: The Google Locomotive Train Logo: Richard Trevithick