Google: Content Stitching Or Quilting Is Not Near Duplicate Content

Jun 21, 2017 - 8:11 am 9 by

Google Content Stiching Quilting

Dawn Anderson followed up on a topic around what is near duplicate content with Google's Gary Illyes - asking if it is similar to content stitching and quilting. As Dawn suspected, Gary said no, it is not. Here it is on Twitter where Dawn asked "'Content stitching / quilting'... this is not the same as near-duplicate as defined in ur prev tweet?" and Gary responded that she is correct.

Here are the tweets:

Dawn then sent me some more technical information on this. She said that Marc Najork, who is now at Google, wrote a paper on this while at Microsoft named Detecting Quilted Web Pages at Scale. Here is the abstract:

Web-based advertising and electronic commerce, combined with the key role of search engines in driving visitors to ad-monetized and e-commerce web sites, has given rise to the phenomenon of web spam: web pages that are of little value to visitors, but that are created mainly to mislead search engines into driving traffic to target web sites. A large fraction of spam web pages is automatically generated, and some portion of these pages is generated by stitching together parts (sentences or paragraphs) of other web pages. This paper presents a scalable algorithm for detecting such “quilted” web pages. Previous work by the author and his collaborators introduced a sampling-based algorithm that was capable of detecting some, but by far not all quilted web pages in a collection. By contrast, the algorithm presented in this work identifies all quilted web pages, and it is scalable to very large corpora. We tested the algorithm on the half-billion page English-language subset of the ClueWeb09 collection, and evaluated its effectiveness in detecting web spam by manually inspecting small samples of the detected quilted pages. This manual inspection guided us in iteratively refining the algorithm to be more efficient in detecting real-world spam.

There is no doubt Google and other search engines are on to this type of behavior but it is always nice pointing to research papers when we can. Thanks Dawn.

Forum discussion at Twitter.

 

Popular Categories

The Pulse of the search community

Search Video Recaps

 
Video Details More Videos Subscribe to Videos

Most Recent Articles

Search Forum Recap

Daily Search Forum Recap: January 20, 2025

Jan 20, 2025 - 10:00 am
Google Ads

Google Ads Weekly Spend Fluctuations Often Due To Market Conditions Or Budget Changes

Jan 20, 2025 - 7:51 am
Google Ads

Google Ads PMax Search Terms Insights Gains Source Data

Jan 20, 2025 - 7:41 am
Google

Google Search Trending Products Carousel On Right Side

Jan 20, 2025 - 7:31 am
Google Search Engine Optimization

Google Search Quality Analyst Detects & Treats AI-Generated Content

Jan 20, 2025 - 7:21 am
Google Search Engine Optimization

Google: Don't Dynamically Update Robots.txt File Multiple Times Per Day

Jan 20, 2025 - 7:11 am
Previous Story: Google Got An Interactive Fidget Spinner