Fighting Search Spam With PhraseRank: The Latest Google Patent Buzz

Jan 2, 2007 • 7:31 am | comments (6) by twitter Google+ | Filed Under Google Search Engine Optimization
 

The blogs have been discussing a new Google patent application named Detecting spam documents in a phrase based information retrieval system. The abstract reads;

An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. A spam document is identified based on the number of related phrases included in a document.

Bill Slawski wrote up his excellent analysis of the patent application at SEO By The Sea and then coined the term PhraseRank at Search Engine Land.

Bill asks, "Is Google using a process like this?" Bill answers "It’s possible."

From the foregoing, the number of the related phrases present in a given document will be known. A normal, non-spam document will generally have a relatively limited number of related phrases, typically on the order of between 8 and 20, depending on the document collection. By contrast, a spam document will have an excessive number of related phrases, for example on the order of between 100 and 1000 related phrases. Thus, the present invention takes advantage of this discovery by identifying as spam documents those documents that have a statistically significant deviation in the number of related phrases relative to an expected number of related phrases for documents in the document collection.

In the WebmasterWorld thread, tedster believes that this is here to target auto generated spam pages, my initial thoughts also from reading the abstract.

This approach seems to me to be aimed at autogenerated pages, constructed from scraped bits and pieces to attract a long tail search to a page with ads. Of course, it does all hang on the base measures of assumed non-spam documents, but I assume Google has enough data to take a decent baseline measure.

Forum discussion at WebmasterWorld.

Previous story: WebmasterWorld Threads of 2006
 

Comments:

Chris Beasley

01/02/2007 07:43 pm

I wish people would stop putting "Rank" after words found in Google patents. This is even stupider than normal because it is an anti-spam, not a ranking, related patent. I realize that people like to seem smart by making up things about search engines, but all it does is serve to confuse the masses and in the end make the whole industry look shady. So lets not make up any more unrelated words that end in "Rank" to ride the coat tails of PageRank and remember that companies patent every idea they have, regardless if they use it or not. The road to hell is paved with assumptions.

Barry Schwartz

01/02/2007 08:23 pm

Wow, bad day Chris?

Bill

01/03/2007 04:46 am

Chris, It's first and foremost a reranking algorithm discussed over 6 (at this point) patent applications, and secondly an anti-spam algorithm. The anti-spam aspect of it is a minor part. The "phraserank" terminology was not my idea, but at the time, it didn't seem like something that would threaten the integrity of the industry. Still doesn't. :)

algoholic

01/04/2007 05:01 am

Chris, this is a just a terminology issue. The algo uses Phrase parameters to get spam away from SERPS and re-Ranks the results. Isn't this Phrase+Rank Related ???

Bill

01/04/2007 05:48 am

<blockquote>[0190] The search system 120 provides a ranking stage 604 in which the documents in the search results are ranked, using the phrase information in each document's related phrase bit vector, and the cluster bit vector for the query phrases. This approach ranks documents according to the phrases that are contained in the document, or informally "body hits." </blockquote> - <a href="http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f=G&l=50&s1=%2220060020607%22.PGNR.&OS=DN/20060020607&RS=DN/20060020607">Phrase-based indexing in an information retrieval system</a>

Garry

01/09/2007 08:34 am

Well, sounds fine but then they will move over to picture format content. I feel that Microsoft can do a lot more than it does with Google secondry. Detect the host nodes and see if they can be disconnected from the NET as punishment. The owners know if spam is being sent from their site.

blog comments powered by Disqus