Search Algorithm Research & Developments

Feb 28, 2005 • 4:38 pm | comments (0) by twitter Google+ | Filed Under Search Engine Strategies 2005 New York
 

Orion couldn't make it due to a mud slide, he will be here tomorrow. They will try to present his presentation with voice overs.

Mike Grehan was up first, he deleted his presentation last night. So he had to restart from scratch, but everyone sympathized. He shows the SEW Forums, and explains that people are very interested in "this stuff." He highlights the keywords co-occurance thread that had 46,401 views, so there is a lot of interest. He said Orion deserves a ton of credit. What are the ages of my three sons? He starts a story that all of his three sons are having a birthday today. He then gives clues to figure out his sons ages, the product of the ages of my sons is 36 and the sum of their ages is equal to the number of windows in the building and the last clue is one son has blue eyes. Mike then gives down a break down on how to figure with equations. the answer is 9, 2, and 2. The last clue, about the blue eyes, said there was an oldest son so the 6, 6, and 1 wouldn't be the right answer. He explains that engines want the most relevant results, which is hard "because end users are search nitwits!" He explained that someone who walks into a travel store and tells the clerk "travel" he will kick you out but search engines respond. The "abundance" problem, too many results, which are the best results, which are the most relevant? Social networks have been extensively researched long before the Web. He briefly explains "Citation analysis", so we have a Web graphic, directed edges and undirected edges (co-citation). If you have questions about this, let me know. Then he discusses PageRank and HITS. PageRank he sums up, PageRank is keyword independent. HITS (Teoma) which is keyword dependent. Great way of explaining the difference. He says there is only one problem with these two solutions, "Neither of them work." He said the problem with PageRank, well they don't use it, so he skipped it. He then went on to HITS and said topic drift, nepotistic linking and runtime analysis are the three issues. The first two were corrected, but runtime analysis is still an issue. He said how AG from Ask Jeeves (Teoma) cracked it. He then put up a graph on the hubs and authorities. So what happened next? B&H algorithm died with AV, then those two went to Google and Hilltop came out. Then in Feb. 03, Google patented Local Link (Bharat). Then he went into Florida (nice little graph), he said it had a lot to do with Google moving from keyword independent to dependent. He throws up some links to advanced papers on this about the future. He finishes off his presentation with an other story. A guy is walking in a desert, he finds a dead guy on the sand with a bag on his back. What was in his bag, a parachute.

Next up was Ask Jeeves named Rahul Lahiri, he helped me out once with a relevancy issue a month ago. He said there is some overlap with Mike's presentation. He goes over the Ask properties and growth numbers. Ask's mission is relevance, index completeness, freshness, and structured data (smart answers). Algorithmic drives are content/text analysis, and link analysis. He focuses on the link side; and shows a graph of page a linking to page b and page c (mike showed something similar). Ask looks at what the "links are about". He goes into the hubs and authority thing. The key challenges are solving the problem in real-time and identifying the communities. He then gives examples of queries such as "buffalo" vs. "bay area airports". They need to weed out the noises from the good stuff. He explains that small enthusiast sites get a chance to rise to the top, which is great. They then can do a better job of identifying different communities, refine search.

Now they give Orion's, Dr. E. Garcia's presentation a try. It sounds like Nacho. Cool, its working. Nacho introduces it. Co-occurrence suggests association or relatedness. I'll summarize it later, very technical.

UPDATE: First excuse me if I make major mistakes in my interpretation of the presentation. I hope Dr. G. (Orion) reviews this and makes any necessary corrections.

Orion's first slide went over some of the basics of co-occurences. Orion explains that co-occurences shows a type of "relatedness" between words. So if you have two terms that are often discussed or found on the same document, they tend to be more related. He then gives an example of the term "aloha". What does aloha make us think of? Hawaii is the correct answer. Orion then explains that this is important when conducting "keyword-brand associations." In Orion's second example he shows an equation he discussed in the forums; c12-index = (n12/(n1+n2-n12))x1000, he overlays an example of a k1 and k2 showing the n12 overlap in the middle as well as explains how an example of 3 keywords makes for a much more complex query in AND mode (n123). He then brings back the old example of "aloha hawaii" to explain "term associations". When you compute the values in Google of "aloha hawaii" versus "aloha indiana" or "aloha montana" you will notice the the C index is much higher with "aloha hawaii" (28.11) versus "aloha indiana" (3.23). This shows that aloha AND hawaii are more "semantically connected" then the other examples. He then shows how you can use the C-index computation to determine which engines would it be easier to target a specific keyword phrase, the higher the c-index, the more competitive that keyword phrase is in the engine, relative to other engines. Orion then explains that c-index can be used to monitor keyword trends over time, showed some very interesting slides to prove it. Orion's benchmark for a "competitive query" is one that has a c-index of above 25 points, he lists a number of those submitted to him via SEW Forums for a stufy he did several months ago. He then computed the c-index of some spam related keywords that were way above the 100 mark on the scale, neat stuff. Orion then explains that most engines use AND (FINDALL) mode as opposed to EXACT. When you look and compare both, you should find the results for EXACT mode within the FINDALL mode. The reason has something to do with order and proximity, where exact mode it does matter and findall it does not. Using this information, Orion defined a new ratio named "EF Ratio" which is equal to (n12 Exact Results/n12 FindALL Results) x 100. What the EF Ratio shows us is the "natural sequences" of words used. Meaning, how are words used in language, documents (real life). EF Ratios can be used to determine competitiveness of a keyword. The lower the number the less competitive it is. In fact, he showed that competitiveness for the same keyword phrases differ from search engine to search engine. The last slide we will save for those who were at the session.

Q & A:

LSI - Mike said that engines will use it, but he implied they are not at this time.

Previous story: Searcher Behavior
 

Comments:

No comments.

blog comments powered by Disqus