Google: Content Stitching Or Quilting Is Not Near Duplicate Content

Jun 21, 2017 • 8:11 am | comments (9) by twitter Google+ | Filed Under Google Search Engine Optimization
 

Dawn Anderson followed up on a topic around what is near duplicate content with Google's Gary Illyes - asking if it is similar to content stitching and quilting. As Dawn suspected, Gary said no, it is not. Here it is on Twitter where Dawn asked "'Content stitching / quilting'... this is not the same as near-duplicate as defined in ur prev tweet?" and Gary responded that she is correct.

Here are the tweets:

Dawn then sent me some more technical information on this. She said that Marc Najork, who is now at Google, wrote a paper on this while at Microsoft named Detecting Quilted Web Pages at Scale. Here is the abstract:

Web-based advertising and electronic commerce, combined with the key role of search engines in driving visitors to ad-monetized and e-commerce web sites, has given rise to the phenomenon of web spam: web pages that are of little value to visitors, but that are created mainly to mislead search engines into driving traffic to target web sites. A large fraction of spam web pages is automatically generated, and some portion of these pages is generated by stitching together parts (sentences or paragraphs) of other web pages. This paper presents a scalable algorithm for detecting such “quilted” web pages. Previous work by the author and his collaborators introduced a sampling-based algorithm that was capable of detecting some, but by far not all quilted web pages in a collection. By contrast, the algorithm presented in this work identifies all quilted web pages, and it is scalable to very large corpora. We tested the algorithm on the half-billion page English-language subset of the ClueWeb09 collection, and evaluated its effectiveness in detecting web spam by manually inspecting small samples of the detected quilted pages. This manual inspection guided us in iteratively refining the algorithm to be more efficient in detecting real-world spam.

There is no doubt Google and other search engines are on to this type of behavior but it is always nice pointing to research papers when we can. Thanks Dawn.

Forum discussion at Twitter.

Previous story: Google Got An Interactive Fidget Spinner
 
blog comments powered by Disqus