Amanda Watlington was up first. The typical causes of duplicate content include many. Tools to detect dup content include your ability to search, take content of 10 words and search on it, also see cyberspyder or webbug, domain ownership info with whois to contact them. Multiple domains is a cause, it occurs when someone buys a domain, and other reasons as to why people have multiple domains... 2004, the monkbiz.com site acquired monkeybusiness.com, both domains are aliases to each other. Once you detect the issue, repair it with 301s, etc. An other cause of dup content is when you redesign your site, for example you change URL structure, or when you change from html to php, etc. She then gave an example. If you see more pages on the engine then in the site, that is a sign of dup content issues. You can 301 or 404 those pages. An other cause are content management systems, happens with e-commerce sites, also sites that have PPC campaigns, she shows examples as ways to get to the same product with two different URLs. Detect the issue, by seeing how the URLs look for products in multiple sections. Repair it by rearchitecture your URLs. She showed some examples... Landing page pages can have many URLs, with same content. You can 301 some of them or use a robots.txt exclusion protocol to tell spiders not to crawl. An other issue is with content syndication and scraping issues and she shows examples (lots of affiliates have this issue).
Tim Converse from Yahoo! he said he spends his days fighting off black hats. How do you spell a short version of duplicate. Dupe? Dup? Doop? etc. Why do search engines care about duplication? User experience, and they dont want to show the same content over and over again for the same query. An other issue is if they only crawl the dup content, they wont have any differentiate with other content (but this is less of an issue). The most important thing is to show the content of the originating source. Where does Yahoo! get rid of dups? At every point in the pipeline, including crawl time, index time, and query time, they prefer to remove dups at the time of the query. They try to limit two urls per host in the serps. Why does Yahoo! ever want to keep a duplicate page? Historically, they didnt want to, because of hard drive costs. If you are looking for news on a specific site, you want to show dup there, some times. They also want to show regional preferences. Also two docs may be similar but not exactly the same. Also, to have redundancy, just in case one site goes down. A legitimate reason to duplicate includes, alternate document formats, legitimate syndication, multiple languages and partial duup pages from boiler plate (nav, disclaimers, etc.). Accidental duplication includes session IDs, soft 404s (no 404 status code) these types are not abusive but can cause issues. Dodgy duplication includes replicating content over multiple domains unnecessarily, "aggregation" of content found elsewhere on the web," indenticla content repeated with minimal value. Others include scraper spammers, weaving/stitching (mix and match content), bulk cross domain apps, bulk dup with small changes. How can you help Yahoo with this issue is by avoiding bulk dup of your content over multiple domains, use the robots.txt to block bots, avoid accidental proliferation of dup content (session IDs, 404s, etc.), avoid dup of sites across many domains and when importing content from elsewhere ask do you own it and are you adding value.
Brian White from Google is last up, he was in the forums once, he works with Matt Cutts, in that group. He spotted on the other panelists. He said he will go quickly over stuf. Types of dup include multiple URLs going to same page, similar content on different pages, syndicated content, manufacturers' databases, printable pages, different languages or countries, different domains and scraped content. How do search engines handle this? they detect dup content throughout the pipeline like Yahoo! The goal is to serve one version of the content in the search results. What can you do to help? Use robots.txt, use 301s, block printable versions, Anchors (#) do not cause problems, minimize boilerplate, if two pages are really similar do you need both?, if you're importing a product database, create your own content. Same content in multiple languages are not dups, and geo tlds are not also. If you syndicate content from others, make sure to include an absolute link to the origin. Scrapers are working on it, let Google know at www.google.com/dmca.html, and more info on bots that are spoofing Google at google webmaster central blog at http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html