Search Algorithms and Research

Feb 28, 2005 • 8:17 pm | comments (2) by twitter Google+ | Filed Under Search Engine Strategies 2005 New York
 

“End users want to achieve their goals with minimum of cognitive load and a maximum of enjoyment.” ~ Marchionini. Why? Because search users are nitwits. Mike asks us to consider the following. What if someone goes into a travel store and when asked what he is looking for, he answers “travel”. He goes on to describe it takes to get ranked in the top ten. Social sciences and bibliometry is also mentioned on the screen and have existence for a long time, even before search engines. They are being applied today in the algorithms that are created for search engines. The web is a social network he continues. Social networks have been extensively researched long before the web. He describes citation analysis and the how this is applied to in search engines. There is a difference between a citation and a reference.

Hyperlink analysis algorithms make either one or both of these simple assumptions. Assumption 1 – A hyperlink from page A to page B. Co citations, if a page C cites pages A and B, then A and B are said to be co-cited by C. Pages A and B being co-cited by many other pages is evidence. There are two main algorithms based on links. PageRank (Google): Each page on the web has a measure of prestige that is independent of any information need or query i.e. keyword independent. Roughly speaking, the prestige of a page is proportional to the prestige of the sum of the prestige scores of pages. HITS or Hyperlink-Induced Topic Search. Problem is that neither of these algorithms work.

The problem with HITS. Topic drift, nepotistic linking, and runtime analysis. Mike says there are three steps to success. They cracked the problem relating to time of a search from 11 seconds to instant. He describes Teoma and subject specific popularity. Adventures in search algorithms: What happened next? Both Krishna Bharat and Monica Hensinger join Google. Mike believes that Florida that moved from keyword independent to keyword dependent. Ending joke: There is a guy trapped in the desert and is looking for life. He finds a man face down in the sand, with a bag on his back. He thinks what was in the bag that would have saved him. Answer: Parachute

Next up was Rahul Lahiri he presents some of the properties that Ask Jeeves controls. Today they are ranked #7 on the web and have done exceedingly well since this time last year. What is their mission: relevance. He goes into general link analysis methods. The challenge is to discovering what the links are about. A link from page A to page B (or C) is a vote or recommendation by the author or page A for the page B (or C). The problem is that if you have a link with the anchor text budget, you don’t know what the budget means. Was it a budget for Budget rent-a-car or budget for someone’s companies?? That’s a problem obviously. He continues that organizing into local subject communities of sites. This is how Teoma views that web. Some of the challenges that they face is that solving the problem in real-time. 200 ms (milliseconds) to do this computation for each query, millions of times per day. You also have to identify the communities. The link structure of the web is noisy. Hubs link to topic specific pages. An example of topic focused vs. broad topic areas. Topic focused is a search for “buffalo” and broad topic areas is a search for “bay area airports”. Some of the benefits are that smaller enthusiast sites get a chance to come up to the top of the search listings (example search: fantasy football). The power of communities is a better vision, expert validation, contextualization, and better user experience.

Next Dr. E. Garcia, a pioneer that has allowed us to better understand the search engines as marketers was next to present. His plane has been delayed till tomorrow because of weather (its snowing heavily here), BUT there is a voice over for his presentation. Tapes starts. He is going to discuss grasping co-occurrence. Co-occurrence suggests association of relatedness. Side note: People are leaving because the audio isn’t too great. But not too many as there is a good amount of interest for this. Back to co-occurrence. Co-occurrence can be: Global, Local, or Fractal. This presentation is highly technical, and while I understand his work, it’s hard to follow. I am trying to get what I can, as its requiring very detailed listening and comprehension at this point. I apologize for any errors in this document.

Example of the case of “Hawaii” which is semantically connected to aloha, Hawaiian, Maui. C-indices can be used to estimate the relative presence of targeted keywords across search engines. He gives another example of “comida + mexicana” that are semantically connected. Example: C-indices can be used to monitor keyword trends, word patterns and topics in time. He goes on to talk about competitive words. Based on his research the example suggest that many competitive queries in Google tend to exhibit C12 indices. His research indicates that overused queries tend to exhibit unusually high C-indices while unrelated terms in a query tend to exhibit very small c-indexes. He gives the example of “guacamole optimization” with a low c-index of 0.12. On to term sequencing: EF-ratios. He talks about various types of queries such as a findall and exact and how order and frequency matter. He goes on to give the example that EF-ratios can be used to estimate the relative frequency of natural sequences and phrases in a source. So what about candidate sequences? These EF ratios can be used to examine how easy or difficult would be to rank for a given sequence in a given search. Keyword competitiveness is specific to each search engine. Some search engines return documents whose sequence can be found. When queried in EXACT mode, some searches return docs in which the queried term can be found. What is it separated by, delimiter (hyphen, underscore), space, or stopwords (in, of, with). So to recap, co-occurrence theory can be used to understand semantic associations between: terms, products, services.

Q: Interested in how we will be searching in 5-10 years time? Personalization? A: Where is search going? Mike did an interview from the founder of Teoma. It was interesting he says. The most interesting is that he said they need to get up 10 steps up the ladder, currently we are 3-4. The one thing that will change this, will be personalization. It’s misunderstood, personalization. It’s not giving you a search just for you. Its about returning results for your peer group. They can start to tailor the search specifically to you. There is data now using genetic algorithms and others set that are using these to create search engines. Mike concludes the more information we give the search engines, the better our experience will be.

Previous story: Search Algorithm Research & Developments
 

Comments:

Aaron Rubin

03/01/2005 02:36 am

Mike Grehan mentioned and the slides had URLs for several advanced algorithms. Anyone know where I can find the links? I hadn't known google had an "Ad Automator." With the API it's no longer much of a time savings, but if you are interested, I guess talk to google at the expo or shoot them an email. Dr. Garcia's algorithms were proposed as a method of measuring of keyword competitiveness, and it should provide a far more complete picture than the current best guess methods of measuring competitiveness. I wonder if someone will write a search term competitiveness tool using his algos. It would be better than what I've seen so far. It also obviously can be used by SEs as well (but he wasn't presenting to the SEs...). Remember all the talk that google was applying filters on "commercial/competitive/highly SEOed/high adwords revenue" terms? C-index or similar would be one great way for google to do exactly like that. I'm far too lazy to retrieve pre- and post-florida data to test the hypothesis The contrast of the generalities and age old info Mike Grehan and Rahul Lahiri presented and the concrete formulas of Dr. Garcia was a bit jarring. The fact that Dr. Garcia wasn't there didn't help the transition either but it was informative.

Barry Schwartz

03/01/2005 03:05 am

Trust me. Dr. Garcia did everything in his power to be there. He was looking forward to it and is very upset about the whole situation. He spent the whole day, along with Nacho, adding voice overs to his presentation to present something. A lot more work on his part.

blog comments powered by Disqus