Duplicate Content Issues

Feb 28, 2006 • 3:27 pm | comments (20) by twitter | Filed Under Search Engine Strategies 2006 New York
 

Moderated by Danny Sullivan – Search Engine Watch and organizer of the conference. Moved into large room…filling up nicely. Introduces topic and first speaker.

Anne Kennedy – Beyond Ink. “Double Trouble - How to avoid duplicate content Penalties.” What it is, why it is a problem, how to spot it, and how to fix it. Will also focus on “:inadvertent” duplicate content.

What is dupe content: Multiple URL’s with same content…identical homepages w/same content. Why is it a problem? Because “they” say so. Recommends looking at the webmaster guidelines at G, Open Directory, and Yahoo. The real reason that this is a problem uis that you wind up confusing the SE robots.

Mirror sites: 1 website, 2 domains. Shows example of rsdfoundation.org. Somebody in the academic CPU center decided that “since SE’s like .edu domains,” they should put the content live on the University of Florida site. Confusing the Bot: 2 URLs. Links to multiple root domains from other sites, with inbound links pointing to different domains for the same site. Describes the real domain and “canonical” domain of a client of hers causing the whole site to not be listed.

Confusing the bot: dynamic URL’s. As robots find dynamic content, the site may be returning a different URL with the same content…this is also a problem. Use “repeat the search with omitted results included” feature to see this happening with some websites. Recommends using robtots.txt exclusion and 301 redirect. 301 redirects: “your hero” Server side redirects to a single canonical domain. Test the page to make sure it works, ensure you use 301’s instead of 3o02. Find code for this at beyongink.,com/301redirect. You can also contact Google and use the “reinclusion request” in the subject line to get help.

Shari Thurow – Grandtastic Designs Will speak about the way some SE’s filter out dupe content. Some ways include but not limited to : content properties, linkage properties. Content evolution, etc…see below. Content properties: SE looks for unique content by removing “boilerplates” such as navigation areas, etc. and analyzing the “good stuff.”

Dupe content filters: linkage properties. Looking at inbound and outbound links to determine if it is dupe content spam? The way that they can determine it isn’t is by seeing that the linkage properties is different for each site. Content evalution: in general 65% of websites will not change info on a daily basis. .8% of web content will change compeletelty on a weekly basis, such as a news site. Host name resolution.. Domain anem, IP address, and a host name are 3 different things. Used example of the host name origin.bmw.com. talks about one method of attempting to spam that is able to be caught because they all resolve to the same host name. Lastly: Shingle comparison: Every document has unique fingerprint. They break this down into a set of word patterns to determine if the content is duplicate. Recommends reading anything by Andre Broder (sp?) about Shingles. With sample site, each word set on a page is similar with 3 pages with unique URLs that have the same word sets on each page. This is not dupe content spam, though. (sorry missed the reason for this)

If you are sharing content across a network/multiple publications is to use the robots exclusion protocol on dupe pages from the “main page.” PDF’s are another type of duplicate content. Use the robots txt file to exclude on of them. Some dupe content is considered spam because the SE’s only allow 2 pages per site per SERP. Thus additional content will end up in the supplemental results. If you know your network is going to deliver dupe content, don’t let the SE’s decide what will be presented in the SERPs – instead, use 301’s and robots.txt.

Jake Baille – True Local “Dr. Phil on Duplicate Content.” Why does it happen? Top 6 dupe content mistakes: circular navigation. Print-friendly pages. Inconsistent linking. Product only pages. Transparent serving. Bad cloaking.

Circular navigation: cause multiple paths though a website. Fix: define in a consistent way method of addressing a page of content. Ie: brand to category to content or brand to content to category, etc. This is irrespective of navigation path. If you are bread crumbing, track paths through cookies.

Print friendly pages: all print friendly pages are diff designed with the same content. Fix: block se’s from print friendly pages

Link not working for you any more: calling directory index pages by different paths such as /directory, /directory/, and /directory/index.asp. fix: make sure you ref pages consistently. To avoid probs with external links, pick a canonical form and 301 redirect all others to the chosen version. Takes six months to “get back” from this.

Product pages with nothing differentiating them form other pages: bad, bad, bad…add new content.

Not good to be transparent: badly impleemted rewrite code, DNS errors with multiple domains. Poorly implemented cloaking/session ID remnoval code. Fixes: domains should be redirected to the main site, not DNS aliased. Picka canonical form to access content and saty with it. Has seen many “incomple” mod rewrites, that allow for the contued reference of the old page.

If the suit doesn’t fit, don’t wear it. Poorly implemented cloaking scripts serve the same doorway page over and over again. Fixes: Don’t use cloaking scripts you didn’t write. Make sure your cloaking script is retuning separate content for each URL being cloaked. (Lots of laughs during this part between him and Matt Cutts) The same content should never be accessible from different URL’s…ever!

Rajat Mukherjee – Yahoo. Informal remarks. Glad to be here. A few comments: in general, try not to make same content available through multiple URL’s. He says SE’s are not vindictive folks, matt does snoop around and take pictures every one in a while (laughs). Rather than looking for ways to demote content, we are trying to find the right content to promote. Whenever possible, try to avoid it. You may want to create a new version of a site…be extra certain that robots don’t crawl both versions. Remember that independent of the size of the index, there will always be capacity constraints.

Matt Cutts – Google Not prepared, but informal remarks. High order nits: what do people worry about? He often finds that honest webmasters worry about dupe content when they don’t need to. G tries to always return the “best” version of a page. Some people are less conscious. The person claimed he was having problems with dupe content and not appearing in both G and Y. Turns out he had 2500 domains. A lot of people ask about articles split into parts and then printable versions. Do not worry about G penalizing for this. Different top level domains: if you own a .com and a.fr, for example, don’t worry about dupe content in this case. General rule of thumb: think of SE’s as a sort of a hyperactive 4 year old kid that is smart in some ways and not so in others: use KISS rule and keep it simple. Pick a preferred host and stick with it…such as domain.com or www.domain.com.

Make sure you are consistent in your linking, because this will cause problems for robots. Use absolute links since they don’t usually get re-written by scarpers. Speaking of…make sure you have a copyright notice at the bottom of each page. Thinks you should use this a a blogger too. They have been trying to produce better ways to figure these kinds of things, and some of this “picking the right host” framework is in the new Bid Daddy center. Also recommend using the sitemaps tool to help diagnose and debug content. Sitempas has a tool where you can take robots.txt “out for a test drive.” How would the Googlebot really respond to this? Will tell you specific things that will be disallowed.

Q&A

First Danny…going back to feeding content. How \can you ensure your page will be the original page and thus the displayed one. Rajat: we are trying hard to determine what the original page is, by using shingling techniques and other techniques to determine if the content is altered. Matt: has heard more people are concerned about this. Asks how many have had content stolen: lot of hands. 3 methods of copying someone else’s content: 1. Steal from search engine (copying directly from results). 2. Outright webpage copy stolen. Usually the lifetime of that is relatively long. 3rd type is RSS scraping…this is more difficult, since it can copied so quickly. This is difficult to catch because it can happen so much quicker than scarping from a webpage might happen. If it is always you that is getting ripped off, he says, that is actually point in your favor. They can try to see who wrote stuff historically…how much you have been copied from, and how much of people’s stuff you copy.

Someone asks about having a hundred directory types of sites, and using the same instructions for adding content, will this trigger duplicate content? Make sure that there is “real content” on each site. He would recommend using one domain to host the directions. Say “we are part of this network so go here for instructions.” Matt adds that diversity is very useful.

Using a hidden DIV…what is the policy on hidden links and JavaScript? Matt: in general hidden links are a bad thing. The content should be of use to a visitor, and thus so should the link be visible. Re: JavaScript use also can be misused to try and cheat, so be careful. SE’s are getting smarter about JS, a lot of times simple heuristics can do the work. Rajat adds: make sure that you know that intent is clear, and finishes with “so cloaking is bad.” (Lots of laughs) Jake ads that if you have an Ajax application that each gets different content, serve a cloaked page to the SE’;s and the Ajax to the users. Hide the Ajax interface from the SE’s, and keep the content on the page (styling it out if needed). Matt says “NO…we will care, and it can get you banned if you are cloaking. He recommends if you have a weird site menu and “all sorts of Ajax,” use the sitemap to serve the content!

Didn’t really get the whole question, but Matt answers “there is nothing wrong with creating a template, but if you aren’t adding useful content it’s going to end up in the ghetto/bad neighborhood with lots of other ‘useless’ sites.” Rajat makes what he says is a philosophical content: SE’s are still in infancy, and while certain limitations re: Ajax etc may exist today, the SE’s will be improving here.

If I have five paragraphs on a page, and two are available on other sites, is this dupe? Rule of thumb: ask someone who has no association with you to look at he two pages and say what they feel. Kind of like the “grandma test.” Someone says would you have your grandma look at your herbal Viagra site?” (laughs…this is from a comment made earlier about herbal Viagra) If lots of content is copied, then it looks more like a less value site.

As great as this session is…catch the next conference and you’ll get the rest of the Q&A.

This is part of the Search Engine Roundtable Blog coverage of the New York Search Engine Strategies Conference and Expo 2006. For other SES topics covered, please visit the Roundtable SES NYC 2006 category archives.

SES NYC Tag:

Previous story: Practical Copyright & Trademark Guidance for Webmasters and SEMs
 

Comments:

N

02/28/2006 09:21 pm

I have a site with ~3000 pages that was fully indexed pre-BigDaddy. It's still getting crawled completely, but now only has ~500 pages in Google. Is this a type of duplicate filter, or a technical problem with the BigDaddy update?

Amit Verma

03/01/2006 09:33 am

Definetly, Duplicate content will be problem for the website when if its so much. Google has already announched about that duplicate content will not be tolerated - http://www.google.com/dmca.html. Copyright issues will be there. Even your website could be banned in google. Beware of Duplicate Content in your website. Regards, Amit Verma - SEO Specialist India

Costin

10/19/2006 01:06 am

i will make a website that will copy the title and description tags from a page where an article is that intrests my user. this page will be like a webdirectory for ex; "work from home" if i copy exactly title and descriptions and below are original content discussions about that page, will i be penalised ? ex dear users below are 10 websites that offer the sollution for your "how do i make a web page" problem... what we think about them is this: 1. is good 2 is etc etc ... so on will i get penalised for copying thair title and description ? (likn will be made with nofollow) ? thank you a million costin

Mikhail Tuknov

12/04/2006 05:39 pm

On December 3rd, 2006, I was searching google for “infatex” which is the name of my website. My site comes up number 1 as usual. Then I looked couple listing down and found .net version of my site. It had the exactly the same title and description as my site. When I clicked on it, I was stunned: .net version had exactly the same content, look and design as infatex.com. I checked Whois information for infatex.net and it was registered on Created: 2006-10-23. I am afraid that this person is trying to ruin my reputation by placing exactly the same site; he is trying to show the google that I have a mirrored site. What should I do?

David McAllister

12/15/2006 05:19 pm

The Infatex.com question is really interesting. If somebody duplicates your content, could they get you knocked out of the search engines?

Larry Lim

06/14/2007 06:52 am

I can verify that printer-friendly pages are bad because I have a 2 year-old site that drops down the SERPs (from page 1 to page 6) from time to time because of it.

bilal

09/12/2007 05:47 pm

what about print version of the page. is this better solution of reduction such overhualing affects from the page..

Pufkin

12/14/2007 10:09 pm

I want to put Text and Printable PDF's on my pages. Would it be safe to do that or would google consider it Duplicate Content. Thanks

Grigori Mikayelyan

02/04/2008 02:31 am

As a comment mikhail Tuknov, I would add the following. In my opinion Duplicating content should be considered a theft and should be punished by law. My suggestion is, to contact the webmaster of that site, and worn him of legal actions against him. That might scare him.

Grigori Mikayelyan

02/04/2008 02:33 am

ad besides, you should contact google, and file a complaint with google.

Jack S.

02/04/2008 02:34 am

That is right Grigori !

bbj

02/10/2008 11:25 am

<li class=”<?php /* Only use the authcomment class from style.css if the user_id is 1 (admin) */ if (1 == $comment->user_id) $oddcomment = “authcomment”; echo $oddcomment; ?>” id=”comment…

Kevin

05/06/2008 04:17 am

My printer-friendly pages drops down the SERPs (from page 1 to page 6) from time to time because of it.

No Name

08/01/2008 05:47 pm

If printer friendly pages are important for your visitors, then create them. Build your website for the humans not robots.

No Name

08/10/2008 09:24 pm

Worse enemy is the duplicate contents, thanks for the info

Clarence

09/14/2008 03:15 am

Hi, nice infor, however I believe that There is no such as thing called "Duplicate Content Penalty". http://www.clarencewang.com/blog/internet-marketing/there-is-no-such-thing-called-duplicate-content-penalty/

No Name

04/28/2009 06:54 pm

Excellent information. Thanks for clearing the confusion about duplicate content. I just tweaked my blog a little further and I believe, I have done the right thing. If you are using wordpress, it is better to use excerpts on all other pages except the Single Post page. Using this code can be very helpful: &lt;div class="entrybody"&gt; &lt;?php if (is_singular()) : ?&gt; &lt;?php the_content(); ?&gt; &lt;?php else : ?&gt; &lt;?php the_excerpt(); _e('&lt;p&gt;&lt;a href="'.get_permalink().'"&gt; Continue reading about '); the_title(); _e('&lt;/a&gt;&lt;/p&gt;'); ?&gt; &lt;?php the_tags( '&lt;p&gt;Tags: ', ', ', '&lt;/p&gt;'); ?&gt; &lt;?php endif; ?&gt; &lt;/div&gt; This code is for self hosted wordpress blogs. But if you are using other platform, this code can still clear up some confusion related to content duplication. Dev.

No Name

07/13/2009 01:36 pm

Should I stop publish my articles on article directories? I used to publish my articles, but now I wander should I stop doing this, because the risk of duplicate content penalty.

assignmenthole

04/16/2011 04:54 am

We just have to live with the malaise of duplicate content indexing - comes with the tutoring territory. Our students post questions over and over and over again - every semester a new batch does it. We cannot remove it because then we miss out on some details. I hope Google grow a third brain cell (actually Im sure they have one) and cut some of us some slack here.

SEO Companies

07/11/2011 01:47 am

 This is one of the problems Google is trying to fix! Although,I personally think that it has always been part of the business!

blog comments powered by Disqus