Specifically with Google, how does Google handle duplicate content from around the Web? It happens all the time. For example, your or I write an article, the article is then syndicated on sites like SEOToday.com, SearchEngineJournal.com, SearchEngineWatch.com, and the New York Times (yea right). So now you have your unique content not only on your site but at 3,4,5, etc. sites. Search engines, especially Google, does not want the same results with the same content coming up for its searcher. But this also happens frequently when you send data feeds of your products on your e-commerce site to shopping search engines. What you might find is that Google will show the shopping search engine's results and your e-commerce site will not show at all for that keyword phrase.
Why does this happen? That is the exact question over at Cre8asite Forums.
You can't blame the engines for wanting only one "source" document in the results. You can't blame the original source of the content to want his/her document to be considered the single "source" document.
The issue arises when you have two or more documents with the same content. Then it is up to the search engine to figure out which is the original source document. How do they do this? Well with Google we have some clues, as pointed out by Bill Slawski in the thread. He quotes a Google patent named Detecting duplicate and near-duplicate files. Thank goodness he took up a quote from the patent:
In response to the detected duplicate documents, the present invention may also function to eliminate duplicate documents (e.g., keeping the one with best PageRank, with best trust of host, that is the most recent) Alternatively, the present invention may function to generate clusters of near-duplicate documents, in which a transitive property is assumed (i.e., if document A is a near-duplicate of document B, and document B is a near-duplicate of document C, then document A is considered a near-duplicate of document C). Each document may have an identifier for identifying a cluster with which it is associated. In this alternative, in response to a search query, if two candidate result documents belong to the same cluster and if the two candidate result documents match the query equally well (e.g., have the same title and/or snippet) if both appear in the same group of results (e.g., first page), only the one deemed more likely to be relevant (e.g., by virtue of a high PageRank, being more recent, etc.) is returned.
PageRank can be one of the main considerations as to which document is the original source, because, in Google's mind, PageRank tells them which documents are more trusted.
Relevancy can also be the determining factor, if the New York Times publishes my article, maybe the other text on the page will be less relevant to the topic of the article. Maybe the links and the content around the primary content will be less relevant to the article. And maybe Google will rank my document as the source.
"Best Trust of Host", I have a feeling that has to do with your linkage data. Bill suggests that it might be "Authorities and Hubs", and although Google is not known to utilize the concept of Authorities and Hubs as would Teoma/Ask Jeeves it seems like the most logical explanation to me. The nodes are interconnected and determining the (maybe) PageRank of those hosts to determine if your the original source, well it sounds cool.
And the "age of the document" as Bill explains. But then they would need to store temporal data of some kind. Simply looking at the page header information would not really be sufficient.