Latent Semantic Analysis (LSA) - Crawl into the Google Algorithm?

Feb 3, 2005 • 5:19 pm | comments (3) by twitter Google+ | Filed Under Google Search Engine
 

Earlier today, I started a little theory on the sandbox dieing, well, there is a ton of smart forum discussion going on in a thread at SEW Forums that I renamed to Major Google Changes: Latent Semantic Analysis?. Now, bakedjack has been really driving the thread into a discussion on LSA.

First let me quote some of randfish post on what LSA is about:

The idea behind this is that by taking a huge composite (index) of millions of web pages, the search engines can "learn" which words are related and which noun concepts relate to one another.
For example, using LSA, a search engine would recognize that trips to the zoo often include viewing wildlife and animals, possibly as part of a tour.
Now, conduct a search at Google for ~zoo ~trips. Note the bolded words match the terms I italicized in the paragraph above. Google is bolding 'related' terms and recognizing which terms that frequently occur concurrently (together / on the same page / in close proximity) in their index.
Some forms of LSA are too computationally expensive. For example, Google isn't smart enough to 'learn' the way some of the newer learning computers do at MIT (see some news reports on this). They cannot, for example, learn through their index that Zebras and Tigers are both examples of striped animals, although they may realize that stripes and zebra are more semanticly connected then ducks and stripes.

Very well done.

Chatting with Ammon Johns earlier today that said that a search engine can perform LSA two ways (more then two but here are two): (1) The way Teoma does it with Hubs and Communities (2) Looking at the words on a page, around the links, and seeing how they are related. Well, its best explained by a my coverage of the Super Session: History of SEO/SEM Theory and Testing - WMW Conf 7, where Daron Babin (aka SEGuru) was reported saying, "He recommends writing a page of content and pulling out the keywords, then give it to someone and ask them to figure out what they keyword is. He said its about the other words on the page, its that important. If the keyword is "apple" is the page about computers or fruit?"

More is being looked at with this in the thread.

Previous story: Is the Google Sandbox Over
 

Comments:

Ammon Johns

02/04/2005 08:35 am

Hey Barry, I think we had some crossed wires earlier because all I was answering was what you quoted me, nemely: "writing a page of content and pulling out the keywords, then give it to someone and ask them to figure out what they keyword is. He said its about the other words on the page, its that important. If the keyword is "apple" is the page about computers or fruit?" Which I took as a kind of "Is this statement true or false?" type of deal and answered in terms of the simplest forms of semantic analysis. There's a lot more to LSI than that alone, and there's at least a couple of threads at Cre8asite that have touched upon it, and on how it actually affects things like cross-linking, filenames, etc. http://www.cre8asiteforums.com/search.php?search_keywords=LSI

Ammon Johns

02/04/2005 08:39 am

Oops, I forgot to mention that the first and most detailed thread on LSI alone was from December 2002, just a couple of months after Cre8asite opened, so you can see that this is a fairly established (though less well known) field of study in serious SEM circles. http://www.cre8asiteforums.com/viewtopic.php?t=593

Barry Schwartz

02/04/2005 01:16 pm

Thank you Ammon, I did kind of take your words out of context. I also tend to over simplify things. Thank you for leading the reader to a more comprehensive discussion on those topics. As always - its an honor.

blog comments powered by Disqus