Search Algorithms: The Patent Files

Aug 8, 2005 • 1:29 pm | comments (3) by twitter Google+ | Filed Under Search Engine Strategies 2005 San Jose
 

Chris Sherman opens up the panel and welcome everyone who have started with this session. He explains that Rand Fishkin was not able to make the panel. They are going to pay an audio file over the presentation. His audio plays, and it’s kinda hard to hear and he speaks to fast. Rand Fishkin is to present details about Understanding Googles Patent: Information Retrieval based on Historical data.

Rand goes on to explain that this data and information will help us rank our pages better. He starts with Document Inception Dates and how they impact the results. He goes on to say that this also works with Yahoo and MSN. He says that the first time a spider finds a page is one way a search engine can find the page and record the date. The first discovery of the link or the registration date of the domain are also ways to find the document inceptions date. His next slide is about how content changes affect changes. He explains that insubstantial and cosmetic changes will be ignored. Links that remain after pages are updated may be considered more valuable. Google may also positively weight changes to the a page. Rand goes on to talk about Temporal Analysis for links affecting rankings. He says the search engines will measure new links that appear to identity trends. He gives an example that it may be odd that a page gains a lot of links in a short amount of time. The search engines will make assumptions based on the link data. The patent document also mentions Google may look freshness of link weights such as the data of appearance, change of anchor text, and change of the page the link is on. He explain the patent document also explains that additional weight may be given to how much they may trust the link such as .gov and .edu links.

His next slide is about how Google may flag a site for spam. It depends on the speed of link gain, source of link gain, and so on. He goes on quickly to talk about identifying doorway and throwaway domains. How long in advance was the domain paid for? DNS Records such as the name of the register, technical contacts, address of the name servers. Google also claims they will use a list of known bad contacts such as ip ranges and so on. Google also considers domain ranking history. Such as sites that jump in rankings may be spamming. Commercial queries may have higher levels of sensitivity to potential spam. Google may monitor the rate at which a was selected over other sites. How fast the site went to the top of the search rankings.

So what are preventative measures against spam? Google may employ these methods to stop spam such as limiting the maximum achievable rank increase over a given period of time. Consider mentions of documents in news articles. Google also looks at traffic analysis, such as a large reduction in traffic may indicate that a result is “stale”. Seasonality may be used to help determine relevance. The patent claims to measure “advertising traffic” for websites. A lot of this is speculation, and no one knows exactly how they collect some of this information at all. Google may also rely on user data such as the number of times that a page is selected from the SERPS. The amount of time users spend on the page/site. The relative amount of time that users spend on a particular site/page could be a factor.

Dr. Garcia from Mi Islita.com is up next to speak about Patents on Duplicated Content. He states some disclaimers. The first is a patent document does not mean implementation. Dr. Garcia talks about Google Patent Detecting Query Specific Duplicate Documents. He explains how they do this. The does a query, candidate results ranked by relevancy (A,B,C,D,E), it goes through query specific Filters, looks for duplicates, removes duplicates, and finally shows the final set to the user (A,B,E). How Google may test to final duplicate documents is that is first sends the document through linearization. It uses a 15 to 100 character sliding window. The idea is to shift the window over the text and calculate the term frequency in that area. There may be many sliding windows. The idea is to collect the top 2 sliding windows, to define a query relevant to the corresponding documents. He goes on to say the 2003 patent will compare a current snippet with snippets already in the final set. His slide displays a list of ranked results, he says that is result is similar to result number 2, but not 1, then it will keep it. The patent document opens the door for using server detection methods, such as standard IR similarity measures and shingles.

He next goes into more complicated math, relating to how they treat a snippet as vectors and compute a cosine similarity. The idea is to analyze the two coordinates in the space. Based on the displacement of the two points, they can get a magnitude, and a DOT product, which they can use to finally measure similarity. The closer of the cosine to 1, the point of comparison will make the document similar. They can set a limit, that is the cosine is a certain point, they can make a decision to reject or accept the document. Retesting is also possible. He goes on to study another way to compare resemblance, he takes information from Altavista Patent published in 2001. What they do is take two linearizied documents. Count individual and common Shingles (or windows). He gives the examples such as the phrases “ a rose is a” from document A and it tends to be compared with document B. His next slide talks about using Jaccard’s coefficient and computing the resemblance of the documents. This I think helps look for false positives if using short shingles, such as unrelated documents that may look similar\and false negatives who use long negatives such as small changes producing large impact.

He asks, is Google implementing the Patent or not? He says he was curious to find out the length of snippets used. He found that Google uses a 15 term sliding window, but a 30 term snippet. So if Google is not using this patent then why should be care? He says copywriters can better understand the how, what, and why of search engine snippets. Optimization of snippets that would improve SERPS clicks throughs. It can also encourage testing of snippet-based filters. Developers in particular can understand hierarchical clustering interfaces. They can also design of snippet based tools such as keyword suggestion tools, so on. Great presentation.

Ani Kortikar from Netramind was up next. He starts off light talking about his kids and show and tell. He says his kids like to play along. He walks to walk through a lot of ideas. His first slide says “Patentology” and predicting futures. He explains that people look at alternative methods of predicting the future. He says when he looked at a lot of the patents, search engines are going through 4 stages of quality control. Interesting. The first one is Administrative Control & Editorial discretion. He says search engines want to know if you have administrative control and if you do you have a greater chance to get treated better. The next stage is Usage – trends, cache bookmarks. He explains when you look at it; you see why they created the toolbars. He asks whether people really refer to older sites more or vice versa? He says search engines are trying to figure this out. The next stage is the Search Experience. This stage has methods for which search engines can analyze the text. The final stage is Intent. They want to know what you want to do, such as buy a home in Arizona, or rent a home in Arizona. They want to know you intent.

He goes on to give another analogy of the situation about the Tortoise and the Hair. He gives the advice to have no surprises such as link growth, content growth, and structure changes. This advice might be overkill in my opinion. He goes on to above about how much green can you get? He mentions individual patents. Such as #1741 – historical data, 4873 – related documents, 9851 – creating hyperlinks, 9576 – direct navigation, 5259 – ranking by re-ranking, and 9499 improving search quality.

He also looked at some patents by Yahoo. He says Google has more houses, but Yahoo has more cards. He mentions ones of interest, such as #3996 – affinity analysis what is the percentage someone might use a related search, 0609 – extracting prices from html, 1372 sales/rev – search weight, 2259 – trend analysis, 0108 cookies to database. He says you probably already know this. Do deliberate and methodical planning, provide multiple information streams, and look at visitor retention analysis and planning, and watch your back. Ani did a good job and very interesting presentation.

Finally up is Jon Glick from Become.com. He explains what a patent is, and whether or not we can trust a patent. Search engines know that their patents will be read by competitors and SEOs, and they author them accordingly. Going over a couple key disclosures such as search engines may take into account, CTR as a ranking factor. He says CTR is a great indicator of relevancy, but its easy to distort. The search engines aren’t too fond of this method. He explains an example of the Google smileys in the toolbar. He said he talked to someone at Google about it, they couldn’t comment, but said that if a site gets a whole lot of simileys then they might have the spam people take a look at it. He goes on to talk about how the time spent on a site may also be a factor. They could use to flag sites where users hit the back button almost immediately. Boosting ranking for final destinations, such as users spending more time on your site.

He goes on to talk about the rate of change in links. Most search engines limit how quickly a site can gain connectivity. A sudden hump in in links can draw scrutiny from spam cops. There are exceptions for spike sites. He next talks about Rate of Change in content. Search engines do keep a history of the site. Duplicate detection technologies are used to find meaningful changes in site content. When a site moves IP address, it is often re-evaluated. He says this could be possible new ownership, change in parked status, search engines don’t like indexing parked domains. He closes saying that all search engines tend to use similar tactics. The core of all search engine ranking remains great content and great connectivity.

Update: HTML and PDF summary is available. Search Engine Patents On Duplicated Content and Re-Ranking Methods (PDF)

Previous story: Mobile Search
 

Comments:

orion

08/18/2005 03:16 pm

HTML and PDF summary is available. For PDF see http://www.miislita.com/search-engine-conferences/duplicated-content-patents.pdf For HTML, just change the extension Orion

Barry Schwartz

08/18/2005 03:24 pm

Thanks Orion, I updated the main entry.

Akash Xavier

01/10/2007 05:20 am

Hi I developed a new content ranking algorithm similar to pagerank and trustrank.

blog comments powered by Disqus