Duplicate Content Issues (Yahoo & Google)

Nov 15, 2006 • 5:51 pm | comments (2) by twitter Google+ | Filed Under WebmasterWorld 2006 Las Vegas

Amanda Watlington was up first. The typical causes of duplicate content include many. Tools to detect dup content include your ability to search, take content of 10 words and search on it, also see cyberspyder or webbug, domain ownership info with whois to contact them. Multiple domains is a cause, it occurs when someone buys a domain, and other reasons as to why people have multiple domains... 2004, the monkbiz.com site acquired monkeybusiness.com, both domains are aliases to each other. Once you detect the issue, repair it with 301s, etc. An other cause of dup content is when you redesign your site, for example you change URL structure, or when you change from html to php, etc. She then gave an example. If you see more pages on the engine then in the site, that is a sign of dup content issues. You can 301 or 404 those pages. An other cause are content management systems, happens with e-commerce sites, also sites that have PPC campaigns, she shows examples as ways to get to the same product with two different URLs. Detect the issue, by seeing how the URLs look for products in multiple sections. Repair it by rearchitecture your URLs. She showed some examples... Landing page pages can have many URLs, with same content. You can 301 some of them or use a robots.txt exclusion protocol to tell spiders not to crawl. An other issue is with content syndication and scraping issues and she shows examples (lots of affiliates have this issue).

Bill Slawski is up ready to talk about dup content. He explains the fundamental issue is that the search engine wants on copy of the content, so which one do they choose. He talked about a site that had 3,500 pages, that had 95,000 pages indexed in Google. He noticed some weird patterns. He saw widgets that expanded and collapsed menus, each with different URLs, but the same content. They replaced the widgets with crawl-able JavaScript functions. The dup content dropped down. Dup content issue #1; reusing manufacturers product description is a common issue. Alternative print pages is issue #2, and there are ways around it. Issue #3 is that rss feeds are syndicated quickly, so you need to become the authoritative source. #4 are canonicalization issues. #5 is session IDs. #5 are multiple data variables. #7, pages with content that are just too similar (page titles, etc.). #8 copyright infringement can be an issue. #9 same pages on subdomains and different ltds. #10, article syndication may be an issue. #11 are mirrored sites. There is a white paper named DUST, Do Not Crawl in the DUST: Different URLs with Similar Text. It discusses ways identify dup content and how they may handle it. In the paper they use the word "dustbuster." The limitation of the DUST paper is that it doesnt detail which pages are kept and which are discarded. Collapsing Equivalent Results is MSN's patent app uses a query independent ranking component, a result analysis component, a navigational model selection mechanism, and more. He shows some results analysis factors such as extensions like the .com might be preferred over the .net, or shorter URLS may be better, or less redirects is better, and so on. It does not mean Microsoft is doing it, it is just a patent app. Searcher and site location or language may be a factor. Obviously, popularity may have an impact. Click throughs may be tracked also.

Tim Converse from Yahoo! he said he spends his days fighting off black hats. How do you spell a short version of duplicate. Dupe? Dup? Doop? etc. Why do search engines care about duplication? User experience, and they dont want to show the same content over and over again for the same query. An other issue is if they only crawl the dup content, they wont have any differentiate with other content (but this is less of an issue). The most important thing is to show the content of the originating source. Where does Yahoo! get rid of dups? At every point in the pipeline, including crawl time, index time, and query time, they prefer to remove dups at the time of the query. They try to limit two urls per host in the serps. Why does Yahoo! ever want to keep a duplicate page? Historically, they didnt want to, because of hard drive costs. If you are looking for news on a specific site, you want to show dup there, some times. They also want to show regional preferences. Also two docs may be similar but not exactly the same. Also, to have redundancy, just in case one site goes down. A legitimate reason to duplicate includes, alternate document formats, legitimate syndication, multiple languages and partial duup pages from boiler plate (nav, disclaimers, etc.). Accidental duplication includes session IDs, soft 404s (no 404 status code) these types are not abusive but can cause issues. Dodgy duplication includes replicating content over multiple domains unnecessarily, "aggregation" of content found elsewhere on the web," indenticla content repeated with minimal value. Others include scraper spammers, weaving/stitching (mix and match content), bulk cross domain apps, bulk dup with small changes. How can you help Yahoo with this issue is by avoiding bulk dup of your content over multiple domains, use the robots.txt to block bots, avoid accidental proliferation of dup content (session IDs, 404s, etc.), avoid dup of sites across many domains and when importing content from elsewhere ask do you own it and are you adding value.

Brian White from Google is last up, he was in the forums once, he works with Matt Cutts, in that group. He spotted on the other panelists. He said he will go quickly over stuf. Types of dup include multiple URLs going to same page, similar content on different pages, syndicated content, manufacturers' databases, printable pages, different languages or countries, different domains and scraped content. How do search engines handle this? they detect dup content throughout the pipeline like Yahoo! The goal is to serve one version of the content in the search results. What can you do to help? Use robots.txt, use 301s, block printable versions, Anchors (#) do not cause problems, minimize boilerplate, if two pages are really similar do you need both?, if you're importing a product database, create your own content. Same content in multiple languages are not dups, and geo tlds are not also. If you syndicate content from others, make sure to include an absolute link to the origin. Scrapers are working on it, let Google know at www.google.com/dmca.html, and more info on bots that are spoofing Google at google webmaster central blog at http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html

Previous story: Purchasing Links


Michael Visser

11/24/2006 01:28 am

I never thought this to be an issue until reading this post, after looking up a few of my own posts both my archived article and active URL are both spidered by SE - same content, two points of access - I'm going to look into removing this issue. Thanks!


04/16/2008 12:16 pm

They replaced the widgets with crawl able JavaScript functions.

blog comments powered by Disqus