Duplicate Content - Resellers Ranking Higher

Feb 16, 2005 • 9:02 am | comments (2) by twitter Google+ | Filed Under Google Search Engine Optimization
 

Specifically with Google, how does Google handle duplicate content from around the Web? It happens all the time. For example, your or I write an article, the article is then syndicated on sites like SEOToday.com, SearchEngineJournal.com, SearchEngineWatch.com, and the New York Times (yea right). So now you have your unique content not only on your site but at 3,4,5, etc. sites. Search engines, especially Google, does not want the same results with the same content coming up for its searcher. But this also happens frequently when you send data feeds of your products on your e-commerce site to shopping search engines. What you might find is that Google will show the shopping search engine's results and your e-commerce site will not show at all for that keyword phrase.

Why does this happen? That is the exact question over at Cre8asite Forums.

You can't blame the engines for wanting only one "source" document in the results. You can't blame the original source of the content to want his/her document to be considered the single "source" document.

The issue arises when you have two or more documents with the same content. Then it is up to the search engine to figure out which is the original source document. How do they do this? Well with Google we have some clues, as pointed out by Bill Slawski in the thread. He quotes a Google patent named Detecting duplicate and near-duplicate files. Thank goodness he took up a quote from the patent:

In response to the detected duplicate documents, the present invention may also function to eliminate duplicate documents (e.g., keeping the one with best PageRank, with best trust of host, that is the most recent) Alternatively, the present invention may function to generate clusters of near-duplicate documents, in which a transitive property is assumed (i.e., if document A is a near-duplicate of document B, and document B is a near-duplicate of document C, then document A is considered a near-duplicate of document C). Each document may have an identifier for identifying a cluster with which it is associated. In this alternative, in response to a search query, if two candidate result documents belong to the same cluster and if the two candidate result documents match the query equally well (e.g., have the same title and/or snippet) if both appear in the same group of results (e.g., first page), only the one deemed more likely to be relevant (e.g., by virtue of a high PageRank, being more recent, etc.) is returned.

PageRank can be one of the main considerations as to which document is the original source, because, in Google's mind, PageRank tells them which documents are more trusted.

Relevancy can also be the determining factor, if the New York Times publishes my article, maybe the other text on the page will be less relevant to the topic of the article. Maybe the links and the content around the primary content will be less relevant to the article. And maybe Google will rank my document as the source.

"Best Trust of Host", I have a feeling that has to do with your linkage data. Bill suggests that it might be "Authorities and Hubs", and although Google is not known to utilize the concept of Authorities and Hubs as would Teoma/Ask Jeeves it seems like the most logical explanation to me. The nodes are interconnected and determining the (maybe) PageRank of those hosts to determine if your the original source, well it sounds cool.

And the "age of the document" as Bill explains. But then they would need to store temporal data of some kind. Simply looking at the page header information would not really be sufficient.

Previous story: The SEO Contract
 

Comments:

Bill

02/17/2005 12:41 am

Hey! Thanks for the mention, Barry. I'm trying to dig a little deeper into how search engines do handle duplicate content. The different patents from Google, Overture, Altavista, and so on seem to be a good start. That "best trust of host" statement has me wondering, too. The concepts of fingerprinting documents is interesting. Finding a number of places to compare, just like is done on fingerprints, and seeing if there are matches at those points is a good way to streamline the process of checking whether documents match without having to run processor intensive comparisons. Trying to determine if there is a match by, for instance, seeing how many of the links in and out of a page match with links in and out of other pages is an approach that also seems like it would work, to some degree.

Barry Schwartz

02/17/2005 04:29 pm

Sounds interesting, if you find anything, please let us know. :) I'll be watching the thread.

blog comments powered by Disqus