Duplicate Content Summit
Moderated by Danny Sullivan Speakers: Vanessa Fox, Google; Amit Kumar, Yahoo; Peter Linsley, Ask.com; and Eytan Seidman, Microsoft Live
Eytan Seidman of Microsoft is up first.
Why duplicate content matters - on a search engine side and what you should be thinking about as a webmaster. I'll go over the two types of duplicate content - accidental content that is created, and duplicate content that is taken from you. I will give principles and then explain how we (Microsoft) handles it.
When you think of duplicate content, it basically fragments your page in some way. You're fragmenting your rank. The page you might want to appear might have different versions of it now. Let's look at a scenario: http://mycompany.com/seattle/posters.htm. someone then tells you it's better to have the keywords in a subdomain so you create http://seattle-posters.mycompany.com. Without the proper redirects, you might run into duplicate content issues. One thing you should think about is session parameters - keep them very simple. Try to avoid feeding the engine a page that has a ton of traffic parameters in it. Duplicate content for locations is also something to think about - if you have unique content, it's fine. But if you have duplicate content for the UK and for the US, it's something to think about. Redirects: as much as possible, always use client-side redirects. Tell the client to redirect rather than do it server-side. JCPenny.com vs. JCPenney.com - it's a duplicate site. Another one is Wikipedia. Sometimes they have similar terms and a server side injection. This is what hurts your ability to concentrate your rank on one page. Another thing to consider is http vs. https. If parts of your site need to be secure, you shouldn't duplicate it all on https.
How do you avoid having people copy your content? All my experience is based on sites I helped administer. One thing is a simple method - tell people that if they use your content, they should attribute it to you. You can also block out types of crawlers, detect user agents, block unknown IP addresses from crawling. There's a blog post on bot verification: http://blogs.msdn.com/livesearch/archive/2006/11/29/search-robots-in-disguise.aspx. It's important to differentiate on whether people are maliciously duplicating your content or if they're trying to help you.
If you think you might duplicate content, consider the following: - Is there any value to duplicate it? Or are you adding new value? - If you're going to take someone's content, make sure you give attribution. - If you have "local" pages, block it using robots.txt.
How does Live search handle duplicate content: We don't have site-wide penalties. We look aggressively for session parameters and tracking parameters at crawl time. We don't want false positives. One way to help us is to redirect those to be hidden from our crawlers. We filter duplicates at run time. We want to give users content that is unique.
The next person who speaks is Peter Linsley from Ask.com.
What's the standard definition of duplicate content? You have the same content on multiple URLs. It's rarely a good idea.
Why is this an issue for search engines? It impairs user experience and it consumes resources. Why is this an issue for webmasters? Fragmentation of pages - you really want to put everything in one place. You should care because search engines might index the wrong page and you don't want to leave it to us. Some cases are beyond your control (scrapers). These concerns are valid but are rare.
How does Ask.com handle duplicate content? It's not a penalty. It's basically the same as not being crawled. It's performed on indexable content - so templates and fluff (footer/header) are not considered. We only filter when the confidence factor is high. There's a low tolerance on false positives. A duplicate content candidate is identified from numerous signals - similar to ranking, the most popular is identified.
What can you do? 1- Act on the areas you're in control of. Consolidate your content under a single URL or implement 301 redirects. Putting up a copyright or creative commons notice. Uniquify content. 2- Make it hard for scrapers. Mark your territory. Try to make your content so that it cannot be used in a generic context. Take legal action. 3- Contact Us You can send a re-inclusion request if you suspect you're being filtered out.
Potential questions for the Q&A: Technical side: webmaster outreach, W3C URL standardization, watermarking/authentication, search engines improve in anti-spam Legal: what else can be done? Economic: make it harder to monetize.
The third person who is up is Amit Kumar of Yahoo:
I'm going to concentrate on Yahoo-specific things and explain what we consider okay and what we hope people will do less often.
Where does Yahoo search eliminate dupe? We try to extract links during the crawl. We're less likely to extract links from pages that we know are duplicates and we're less likely to crawl new documents from duplicate sites. Primarily, we try to keep as many duplicate documents as possible in our index and use query time duplicate elimination: limits of 2 URLs/host and domain restrictions.
There are legitimate reasons to duplicate - 1. Alternate document formats - PDF, printer friendly pages 2. Legitimate syndication (newspaper sites have wire-service stories) 3. Different languages 4. Partial duplicate pages: navigation, common site elements, disclaimers.
Accidental duplication - Session IDs in URLs: remember, to search engines, a URL is a URL. We can only crawl so much. Two URLs that refer to the same document look like duplicates. We can sort this out but it may inhibit crawling. Embedding session IDs in non-dynamic URLs doesn't change the fundamental problem.
The other accident page is Soft 404s - not found error pages should return a 404 code instead of a 200 status code. If not, we can crawl many pages of the "not found" page.
Dodgy duplication - - Replication content across multiple domains - Aggregation - Identical content with minimal value
Abuse: - Scraper spammers, weaving/stitching, bulk cross-domain duplication, bulk duplication with small changes. All of these are outside our content guidelines and can lead to unanticipated results for publishers.
How you can help us: * Avoid bulk duplication of underlying documents. Do search engines need all versions of pages with small variations? Use robots.txt. * Avoid accidental proliferation of many URLs for the same documents - sessionIDs, 404s, etc. Consider sessionID-free or cookie-free paths for crawlers. They are not abusive for our guidelines but may impair effective crawling. * Avoid duplication of sites across many domains. * When importing content from somewhere else: - Attribution. - Ask: do you have the rights to it? Are you adding value in addition or just duplicating?
In the last few months, we came out with Robots nocontent. It's a microformat like tags to mark up low-value content (like disclaimers, etc.). It's useful to indicate where the core content is. We also came out with the ability to remove URLs that you don't want to be crawled.
There are a lot of different kinds of duplicate sites.
Similar content: I'm sure you've seen the Buffy the Vampire Slayer episode where they had 2 Xanders. You just need to combine them into one page.
But sometimes they are more similar but different. There are 2 Willows - and these were not the same. One was more evil than the other. They just needed to be distinguished a little more. That is kind of easy - you should know how to do that because you have all watched Buffy.
Other things: syndicated content, manufacturer's database, printable versions, multiple language and countries. You need to add value to distinguish your version of the database but you might want to add stuff to robots.txt for printable versions.
Blogs are more of an issue lately becasue of RSS, archive pages, category pages. There are those issues - scraping types of issues.
What else do you hate? Would you want to have the ability to rewrite a URL and count everything the same for crawlers. Can you verify authorship?
Question and Answer:
Q: I was very surprised when Eytan said that redirects should be client side. How does a bot cope with meta refresh rather than a 301? Can the 4 search engines agree on a variable/parameter to track URLs (to gets around the sessionID problem)? Eytan: When I say client side, I'm including things like 301 redirect. I do include that in 301 redirect. When I refer to server-side, I refer to injections. Amit: In the JCPenney example, there are 2 identical sites. Who is copying who? Having a redirect is important for this. Peter: For the most part, a meta refresh is the same thing as a 301. Amit: About the parameter, Google had a note in the guideline not to have an id in the URL. But that has been removed. It's hard for us to figure it out. We're all working on making this better for you. Vanessa: Maybe not all CMSes can handle that. Potentially, you can tell us what the parameter is and we can do a rewrite. Danny: How many of you want to express that through a robots.txt file? Someone says "as many [ways] as possible."
Q: On nofollows, if we use too many of those to get rid of duplicate content, would that be like a red flag? Vanessa: You really only have control over your link. People can still link to your page without a nofollow. You might want to robots.txt them out instead.
Q: I work with a lot of companies that use Wordpress as a CMS. There are multiple author designations, etc. There is a concern that these are duplicate content. They are, however, but I want them to be referencable search result for at least 2-3 different tags. Can you reach out and work with blogging platforms so that we don't have to do it ourselves? Vanessa: We should do that more so it's easier for a site owner. Eytan: What's the easiest solution you're trying to accomplish? Followup: When someone blogs, how do you designate the primary page vs. a tag page? Vanessa: We can usually sort that stuff out. Amit: It goes back to - is there a specific thing that you're looking for: "here's what I expected, here's what I found?"
Danny: there are 3 ways for duplicate content - scraping, syndication, and site-specific duplicate content (you have 2 versions of an English language - like UK vs. US). He asks for a show of hands and most people are concerned about duplicate content within their own site.
Q: I have a question about daystamp, timestamp, and the date that the page has been discovered. I've seen people's sites being indexed but they were discovered before mine. How do you figure out who is first? Eytan: Over time, outside of the news scope, it's not really a big part of it. Over time, we're looking for other signals to determine which is the canonical source for the content. There definitely will be an aspect of that. We do leverage for both scenarios. Peter: It's gameable so we don't act so much on it. Amit: You can use a sitemap to help us determine who created it first. Danny: When you try to figure out what is the best copy, it's usually the "first copy." But some people might consider it the most-linked-to copy. What is it? Eytan: Usually, it's what ranks best in run-time. We're not looking at time. We're looking at a huge number of other factor. If someone copied TechCrunch or Search Engine Land verbatim, they probably won't be able to rank higher. Danny: What if you said "this is the date of this document to prove that I got there first?" Peter: It's something to think about but it is gamable so we need to be careful. Danny: There ought to be a way for search engines to see that it came from a sitemap because the search engine was pinged from this site first.
Q: My question is regarding sitemaps. A few months ago, something was announced about unifying the standard and we haven't heard anything about it since. Is there continued work on this and who is involved? Danny: They unified. They walked hand in hand. It was beautiful. Vanessa: Yes. I believe everyone is supporting it. We meet regularly to discuss other things that we can do. I would watch for more stuff because we have more stuff planned as we evolved. Amit: We created this format to get a long list of URLs. You should expect a lot in the future. Eytan: We're experimenting with auto-discovery in robots.txt. You should see stuff within 6 months or so.
Q: Tracking URLs and parameters are my biggest concern. Vanessa: I think we need to link to the canonical version to prevent dilution.
Q: We have a site that has data in the forms of graphs instead of text. How do you recommend that a search engine knows that this is unique content? Vanessa: Can you have a textual description for graphs? Followup: We have done that. Peter: Taking that a step further, use that description as the title of the page. Vanessa: I'm assuming your graph is an image, so we're not going to assume that they are duplicate pages. Eytan: Images don't really get parsed right. We don't know what the graphs mean. Adding text that differentiates the page would help a lot.
Q: I have a site that does how-to videos. Should we be worried about duplicate content for those videos? Vanessa: You're thinking for video search results? Followup: Exactly. Amit: Does the syndication of the video point back to the original page? Followup: In some cases, yes, in some cases, no. We do upload to YouTube and it goes back to your site. Vanessa: You can block with Robots.txt. I don't know too much about video search. I can try to find out more.
Q: I have 2 suggestions/questions. First is for digital signatures to prevent scraping - can we put in a unique identifying code to prove that we're the source? Also, there are no good reporting tools that indicate what the engines consider as duplicate content. What do you suggest? Danny: I think that would be cool. How many people would like that? [Applause.] Going back to the watermark thing, it's really hard because it needs to be a standardized format. Eytan: We've seen these tried these in other mediums like email with fairly moderate success. How do you get people to adopt this? Michael Gray shouts from the audience: If you don't offer it, nobody can adopt it. Vanessa: If the original person doesn't authenticate but a more savvy person does, they can claim your content. Amit: We'll certainly look at all these things. Danny: I do have to give them credit. They rolled out a great amount of things in the past few years. Some stuff works, some stuff doesn't, but they try. Duplicate content reporting would be cool.
Q: Danny, you raised a question of what was the biggest problem and some people said that their content was scraped. We have resellers and we give them our content. They've used our content. Hundreds of thousands of sites have our content. What do we do now short of rewriting our site? Danny: That's where a duplicate content reporting tool would be helpful. How do you prove that it's your content though? That's where it gets difficult. You can possibly do manual reviews. Followup: What do I do right now? Vanessa: In your situation, it may be harder for you to contact all these people. Your best option really is to rewrite your content and make your content better. Followup: We gave the content out beforehand. It was great before, but it's a real problem for us now. Amit: In perspective, a reseller who has your content - this affects you if a page is exactly identical to your page. It's very hard given all the factors if that's why your page isn't being indexed. There are other reasons why your page may not be indexed, like your reseller is more established than you are. Then you might have to think about how to build your brand rather than focus on duplicate issues. I'd be a little careful about attributing to that source of duplicate content.
Q: A little different duplicate content problem - sites like eBay that have good SEO teams - you get multiple listings from the same business. eBay's subdomains are blatant and take over all the results. Vanessa: From our perspective, we want to show a variety of results from a variety of sites. We'll look at this. Followup: For years, eBay has done this and not a single search engine has addressed it. Danny: Google hates eBay now and Yahoo likes them. [laughter] Amit: I know you're being facetious but it doesn't matter. Give us test-cases. Show us which hosts from eBay shouldn't appear. We're always looking for these test-cases. Please submit them. We're absolutely working on it. We don't want other companies with better SEO expertise any higher priority. Eytan: It's a ranking challenge as well. Perhaps the best results are on eBay's properties. There are filters at runtime. We all want to show the user content that is substantially unique.
Q: I want to question the premise of this panel. Why should we only see one listing of a piece of content? Let's say I wanted to find a fact - Abraham Lincoln's birthday. If Wikipedia comes up, what if I don't want Wikipedia. What about a way to group together duplicates? Danny: Do it in AJAX. It's all nice and slick. Peter: When we talk about duplicates, we're talking about the same exact content. Vanessa: Your example - if you don't want a site like Wikipedia - you might find a site that has the same stuff as Wikipedia. Followup: In the retail area, is it important to have 5000 descriptions of red widget #5? Vanessa: That's something we can take a look at for the user experience side. We might want to do experiments to see what would be the best and have that option available.