Home / Google News / Google SEO / Fighting Search Spam With PhraseRank: The Latest Google Patent Buzz

Fighting Search Spam With PhraseRank: The Latest Google Patent Buzz

Jan 2, 2007 - 7:31 am 6 — by Barry Schwartz

Filed Under Google Search Engine Optimization

The blogs have been discussing a new Google patent application named Detecting spam documents in a phrase based information retrieval system. The abstract reads;

An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. A spam document is identified based on the number of related phrases included in a document.

Bill Slawski wrote up his excellent analysis of the patent application at SEO By The Sea and then coined the term PhraseRank at Search Engine Land.

Bill asks, "Is Google using a process like this?" Bill answers "It’s possible."

From the foregoing, the number of the related phrases present in a given document will be known. A normal, non-spam document will generally have a relatively limited number of related phrases, typically on the order of between 8 and 20, depending on the document collection. By contrast, a spam document will have an excessive number of related phrases, for example on the order of between 100 and 1000 related phrases. Thus, the present invention takes advantage of this discovery by identifying as spam documents those documents that have a statistically significant deviation in the number of related phrases relative to an expected number of related phrases for documents in the document collection.

In the WebmasterWorld thread, tedster believes that this is here to target auto generated spam pages, my initial thoughts also from reading the abstract.

This approach seems to me to be aimed at autogenerated pages, constructed from scraped bits and pieces to attract a long tail search to a page with ads. Of course, it does all hang on the base measures of assumed non-spam documents, but I assume Google has enough data to take a decent baseline measure.

Forum discussion at WebmasterWorld.

Previous Story: WebmasterWorld Threads of 2006

Next Story: Google Calculator Breaks On New Years

The content at the Search Engine Roundtable are the sole opinion of the authors and in no way reflect views of RustyBrick ®, Inc
Copyright © 1994-2026 RustyBrick ®, Inc. Web Development All Rights Reserved.
This work by Search Engine Roundtable is licensed under a Creative Commons Attribution 3.0 United States License. Creative Commons License and YouTube videos under YouTube's ToS.

Fighting Search Spam With PhraseRank: The Latest Google Patent Buzz

Barry Schwartz / Executive Editor

Popular Categories

The Pulse of the search community

Google Search Volatility

Search Video Recaps

Most Recent Articles

Daily Search Forum Recap: June 15, 2026

Google Ads Bidding Target Optimization Changing, Promotion Mode Beta, Smart Bidding Exploration Expands

Google Search Rolls Out Information Agents In AI Mode For A Fee

Google Local Finder Interface Drops Pagination

Google Ads To Limit Ad Impressions From Unqualified Advertisers

Google Business Profile Owner/Manager Access Invites Not Working For Some