Duplicate Content Issues

Feb 28, 2006 • 3:27 pm | comments (20) by twitter | Filed Under Search Engine Strategies 2006 New York

Moderated by Danny Sullivan – Search Engine Watch and organizer of the conference. Moved into large room…filling up nicely. Introduces topic and first speaker.

Anne Kennedy – Beyond Ink. “Double Trouble - How to avoid duplicate content Penalties.” What it is, why it is a problem, how to spot it, and how to fix it. Will also focus on “:inadvertent” duplicate content.

What is dupe content: Multiple URL’s with same content…identical homepages w/same content. Why is it a problem? Because “they” say so. Recommends looking at the webmaster guidelines at G, Open Directory, and Yahoo. The real reason that this is a problem uis that you wind up confusing the SE robots.

Mirror sites: 1 website, 2 domains. Shows example of rsdfoundation.org. Somebody in the academic CPU center decided that “since SE’s like .edu domains,” they should put the content live on the University of Florida site. Confusing the Bot: 2 URLs. Links to multiple root domains from other sites, with inbound links pointing to different domains for the same site. Describes the real domain and “canonical” domain of a client of hers causing the whole site to not be listed.

Confusing the bot: dynamic URL’s. As robots find dynamic content, the site may be returning a different URL with the same content…this is also a problem. Use “repeat the search with omitted results included” feature to see this happening with some websites. Recommends using robtots.txt exclusion and 301 redirect. 301 redirects: “your hero” Server side redirects to a single canonical domain. Test the page to make sure it works, ensure you use 301’s instead of 3o02. Find code for this at beyongink.,com/301redirect. You can also contact Google and use the “reinclusion request” in the subject line to get help.

Shari Thurow – Grandtastic Designs Will speak about the way some SE’s filter out dupe content. Some ways include but not limited to : content properties, linkage properties. Content evolution, etc…see below. Content properties: SE looks for unique content by removing “boilerplates” such as navigation areas, etc. and analyzing the “good stuff.”

Dupe content filters: linkage properties. Looking at inbound and outbound links to determine if it is dupe content spam? The way that they can determine it isn’t is by seeing that the linkage properties is different for each site. Content evalution: in general 65% of websites will not change info on a daily basis. .8% of web content will change compeletelty on a weekly basis, such as a news site. Host name resolution.. Domain anem, IP address, and a host name are 3 different things. Used example of the host name origin.bmw.com. talks about one method of attempting to spam that is able to be caught because they all resolve to the same host name. Lastly: Shingle comparison: Every document has unique fingerprint. They break this down into a set of word patterns to determine if the content is duplicate. Recommends reading anything by Andre Broder (sp?) about Shingles. With sample site, each word set on a page is similar with 3 pages with unique URLs that have the same word sets on each page. This is not dupe content spam, though. (sorry missed the reason for this)

If you are sharing content across a network/multiple publications is to use the robots exclusion protocol on dupe pages from the “main page.” PDF’s are another type of duplicate content. Use the robots txt file to exclude on of them. Some dupe content is considered spam because the SE’s only allow 2 pages per site per SERP. Thus additional content will end up in the supplemental results. If you know your network is going to deliver dupe content, don’t let the SE’s decide what will be presented in the SERPs – instead, use 301’s and robots.txt.

Jake Baille – True Local “Dr. Phil on Duplicate Content.” Why does it happen? Top 6 dupe content mistakes: circular navigation. Print-friendly pages. Inconsistent linking. Product only pages. Transparent serving. Bad cloaking.

Circular navigation: cause multiple paths though a website. Fix: define in a consistent way method of addressing a page of content. Ie: brand to category to content or brand to content to category, etc. This is irrespective of navigation path. If you are bread crumbing, track paths through cookies.

Print friendly pages: all print friendly pages are diff designed with the same content. Fix: block se’s from print friendly pages

Link not working for you any more: calling directory index pages by different paths such as /directory, /directory/, and /directory/index.asp. fix: make sure you ref pages consistently. To avoid probs with external links, pick a canonical form and 301 redirect all others to the chosen version. Takes six months to “get back” from this.

Product pages with nothing differentiating them form other pages: bad, bad, bad…add new content.

Not good to be transparent: badly impleemted rewrite code, DNS errors with multiple domains. Poorly implemented cloaking/session ID remnoval code. Fixes: domains should be redirected to the main site, not DNS aliased. Picka canonical form to access content and saty with it. Has seen many “incomple” mod rewrites, that allow for the contued reference of the old page.

If the suit doesn’t fit, don’t wear it. Poorly implemented cloaking scripts serve the same doorway page over and over again. Fixes: Don’t use cloaking scripts you didn’t write. Make sure your cloaking script is retuning separate content for each URL being cloaked. (Lots of laughs during this part between him and Matt Cutts) The same content should never be accessible from different URL’s…ever!

Rajat Mukherjee – Yahoo. Informal remarks. Glad to be here. A few comments: in general, try not to make same content available through multiple URL’s. He says SE’s are not vindictive folks, matt does snoop around and take pictures every one in a while (laughs). Rather than looking for ways to demote content, we are trying to find the right content to promote. Whenever possible, try to avoid it. You may want to create a new version of a site…be extra certain that robots don’t crawl both versions. Remember that independent of the size of the index, there will always be capacity constraints.

Matt Cutts – Google Not prepared, but informal remarks. High order nits: what do people worry about? He often finds that honest webmasters worry about dupe content when they don’t need to. G tries to always return the “best” version of a page. Some people are less conscious. The person claimed he was having problems with dupe content and not appearing in both G and Y. Turns out he had 2500 domains. A lot of people ask about articles split into parts and then printable versions. Do not worry about G penalizing for this. Different top level domains: if you own a .com and a.fr, for example, don’t worry about dupe content in this case. General rule of thumb: think of SE’s as a sort of a hyperactive 4 year old kid that is smart in some ways and not so in others: use KISS rule and keep it simple. Pick a preferred host and stick with it…such as domain.com or www.domain.com.

Make sure you are consistent in your linking, because this will cause problems for robots. Use absolute links since they don’t usually get re-written by scarpers. Speaking of…make sure you have a copyright notice at the bottom of each page. Thinks you should use this a a blogger too. They have been trying to produce better ways to figure these kinds of things, and some of this “picking the right host” framework is in the new Bid Daddy center. Also recommend using the sitemaps tool to help diagnose and debug content. Sitempas has a tool where you can take robots.txt “out for a test drive.” How would the Googlebot really respond to this? Will tell you specific things that will be disallowed.


First Danny…going back to feeding content. How \can you ensure your page will be the original page and thus the displayed one. Rajat: we are trying hard to determine what the original page is, by using shingling techniques and other techniques to determine if the content is altered. Matt: has heard more people are concerned about this. Asks how many have had content stolen: lot of hands. 3 methods of copying someone else’s content: 1. Steal from search engine (copying directly from results). 2. Outright webpage copy stolen. Usually the lifetime of that is relatively long. 3rd type is RSS scraping…this is more difficult, since it can copied so quickly. This is difficult to catch because it can happen so much quicker than scarping from a webpage might happen. If it is always you that is getting ripped off, he says, that is actually point in your favor. They can try to see who wrote stuff historically…how much you have been copied from, and how much of people’s stuff you copy.

Someone asks about having a hundred directory types of sites, and using the same instructions for adding content, will this trigger duplicate content? Make sure that there is “real content” on each site. He would recommend using one domain to host the directions. Say “we are part of this network so go here for instructions.” Matt adds that diversity is very useful.

Using a hidden DIV…what is the policy on hidden links and JavaScript? Matt: in general hidden links are a bad thing. The content should be of use to a visitor, and thus so should the link be visible. Re: JavaScript use also can be misused to try and cheat, so be careful. SE’s are getting smarter about JS, a lot of times simple heuristics can do the work. Rajat adds: make sure that you know that intent is clear, and finishes with “so cloaking is bad.” (Lots of laughs) Jake ads that if you have an Ajax application that each gets different content, serve a cloaked page to the SE’;s and the Ajax to the users. Hide the Ajax interface from the SE’s, and keep the content on the page (styling it out if needed). Matt says “NO…we will care, and it can get you banned if you are cloaking. He recommends if you have a weird site menu and “all sorts of Ajax,” use the sitemap to serve the content!

Didn’t really get the whole question, but Matt answers “there is nothing wrong with creating a template, but if you aren’t adding useful content it’s going to end up in the ghetto/bad neighborhood with lots of other ‘useless’ sites.” Rajat makes what he says is a philosophical content: SE’s are still in infancy, and while certain limitations re: Ajax etc may exist today, the SE’s will be improving here.

If I have five paragraphs on a page, and two are available on other sites, is this dupe? Rule of thumb: ask someone who has no association with you to look at he two pages and say what they feel. Kind of like the “grandma test.” Someone says would you have your grandma look at your herbal Viagra site?” (laughs…this is from a comment made earlier about herbal Viagra) If lots of content is copied, then it looks more like a less value site.

As great as this session is…catch the next conference and you’ll get the rest of the Q&A.

This is part of the Search Engine Roundtable Blog coverage of the New York Search Engine Strategies Conference and Expo 2006. For other SES topics covered, please visit the Roundtable SES NYC 2006 category archives.


Previous story: Practical Copyright & Trademark Guidance for Webmasters and SEMs
blog comments powered by Disqus