Search Technology Archives

Did Google Trends Predict the Super Tuesday Results?

Yesterday was Super Tuesday in the United States and many of the presidential primaries were held. SEO.com has written up a post that seems to predict the trends in each state to determine the winners. Using Google Trends, it seems plausible that people's searches impacted the winners.

Of course, not everything is perfect using Google Trends, as pointed out by DigitalPoint Forums members. However, it's interesting to see how many results correlated with the trending data.

I can imagine that other competitive intelligence firms will come out with results later today or tomorrow that will measure the results even more accurately.

Or will they?

Forum discussion continues at DigitalPoint Forums.

posted Tamar Weinberg in Search Technology at February 6, 2008 9:49 AM Comments (2)

Who Is Better At Finding Duplicate Content?

So who likes finding duplicate content more? Or who has the most to gain by finding duplicate content - Google or Copyscape? The forum members on Digitalpoint are having an interesting conversation on the differences in the ways Google and Copyscape find duplicate content. Most agree they use different algorithms to find duplicate content, but how hard can it be though? The consensus remarkably is that the majority of people see both companies as unique in their way to find duplicate content.

WebmasterWorld is also having some discussion on duplicate content. Namely, how do you deal with heavily copied content? Some of the members has some excellent advice on dealing with duplicate content these days.

1. Don't Go After the Content Scrapers


I rarely go directly after the copiers these days - instead I focus on strengthening the website itself. It's harder for copied content to beat a strong website - but from time to time it happens.

2. Go After the Web Hosting Company Instead

First, file DMCA notices against all US-based webhosts or server companies involved. Personally, I skip informing the webmaster first as he might not be US-based and it seldom works. Hosts and server companies normally take down whole sites or even servers (not just individual infringing pages) meaning potentially crippling losses for the webmaster renting a dedicated server from which he runs multiple scrapers.

3. Strengthen Your Website

The older a site is and the more pages and more trust it gains (along with measures to help deter scraping like using full urls, etc.) the less likely that having scraped content will cause any harm.

4. Don't Abandon The Original Content

Second, if you do re-write, don't abandon your original content. There's obviously a market for it, so arrange for it to be used on other sites, by agreement and with appropriate links.

5. Insert Your Website or Company Name Into The Content

I try and generally include a link to two in my content to another content page in my site, since most people copying do so by way of automation and pick up your link.

6. Consider That Eventually Someone Will Steal Your Content

i've personally given up on any rights on anything in any matter on the internet - I don't publish anything or put anything on the internet which I don't want to be re-distributed on a massive scale, be edited, laughed at, cried about, never quoted

Continued discussion at Digitalpoint and WebmasterWorld

posted Phoenix in Search Technology at October 4, 2007 12:38 PM Comments (0)

Google Campaign Optimizer – A Friend or Foe?

Google AdWords and other paid search marketing (PPC or pay-per-click) platforms including Yahoo! Search Marketing and MSN AdCenter provide very robust administration platforms to allow for a variety of options to be specified when serving ads to searchers and contextual partner sites. Just over the past two years, these systems have become increasingly capable of allowing advertisers to focus their budgets towards keyword phrases that are providing return on their advertising spend. Even with all these capabilities, a lack of testing, research and/or understanding of past performance can make campaigns inefficient.

Google offers one particular option to its AdWords advertisers called “campaign optimizer.” In the past, this was an option that could be found in the campaign level settings page, however Google seems to be pushing their “enhancements” more often within the main dashboard. A recent thread started at WebmasterWorld forums describes this situation, as a user relates seeing messages within his dashboard to the effect of running the campaign optimizer in order to “increase traffic by 17%.”

Is this the right tactic to use? Should advertisers allow Google to enhance their campaigns’ performance? In my opinion it depends on the account. If the owner of the account has the time to run tests and optimize the campaign by themselves, they will likely do at least as well as or better than Google, without having to fear that in Google’s eye’s “optimization = spending the entire daily budget each day.”

Join the discussion at WebmasterWorld forums.

posted chrisboggs in Search Technology at October 4, 2007 8:33 AM Comments (0)

Is PageRank Juice the Only Value of a Link?

A Cre8asiteforums discussion called The Divide Between Search Engines And Seo's - "No Follow" Fiasco points to two somewhat emotional discussions elsewhere on the possible ramifications or practices of using the "rel=nofollow" tag in links.

In one case, the US Federal Trade Commission enters the arena to stir up the pot for paid links.

In another case, a blog directory has been accused of not passing "link juice" to the blogs who have submitted to it and using JavaScript "onclick" code in their URLS.

The thread at Cre8asiteforums points to both discussions and members returned to voice their opinions. Both the Sphinn and SEOFastStart discussions provide a chance to learn more, regardless of who is right or wrong. I was voted completely out into the universe in Sphinn for remarking that PR can't possibly be the "only" reason people submit to blog directories.

Apparently, I'm terribly wrong about that.

posted cre8pc in Search Technology at September 13, 2007 12:07 PM Comments (3)

Getting Users to Bookmark Your Site: Traditional Bookmarking vs. Social Bookmarking

Traditional bookmarking seems obsolete. Adding a bookmark to your browser, to many, is a practice that has been since seemingly replaced by newer methods -- social bookmarking sites, if you will.

However, not everyone is aware of these social bookmarking sites, nor are they ready to abandon their traditional methods of bookmarking. A Cre8asite Forums thread touches upon this subject. In the thread, administrator EGOL suggests that traditional means of bookmarking stay intact, and social bookmarking methods through sites like AddThis.com not necessarily be implemented -- or at least done as a secondary option.

This is exactly what other members agree is the right thing to do:

Absolutely. For most visitors bookmark this site means triggering a bookmark in their browser. I would add social bookmarking micro-icons for the rest. It also comes out as more honest.
The other angle is whether to have social bookmarking sites only, browser bookmarking only, both, or simply some text saying "hey, press Control-D to bookmark this!". IMHO, you need at the very least the Ctrl-D text and some of the social bookmarking sites.

For those not ready to jump into the social bookmarking realm, you should make sure that if you include a bookmarking option, your website accommodates these types of users.

Discussion continues at Cre8asite Forums.

This article was written this past Monday and scheduled for publication on Wednesday, May 23rd.

posted Tamar Weinberg in Web Promotion at May 23, 2007 9:27 AM Comments (1)

75% of Google's Blogspot Blogs are Spam

On a recurring theme of Internet spam, a study discussed in WebmasterWorld indicates that three out of four blogs -- or 75% -- are spam.

According to the study (PDF link):

...14 of the top-15 doorway domains have a spam percentage higher than 74%; that is, 3 out of 4 unique URLs on these domains (that appeared in our search results) were detected as spam. To demonstrate the need for scrutinizing these sites, we scanned the top-1000 results from two queries – “site:blogspot.com phentermine” and “site:hometown.aol.com ringtone” – and identified more than half of the URLs as spam easily.

Here is a chart from the study showing the "top doorway domains and their spam percentages (among the search results in our data)":
top doorway domains and their spam %

The reason for this is the suspicion that the popular blogging service is free. One WebmasterWorld member states:

The trouble is, there's no algorithm that can automatically factor in the price of a service. It's free to set up a blog on Blogger, so it can be abused more easily. If these spammers actually had to pay for a new domain name every time they set up a splog, they wouldn't bother.

Other findings of this research showed the spam percentages for Top-Level Domains (TLDs):

  • 68% of .info TLDs are spam
  • 53% of .biz TLDs are spam
  • 12% of .net TLDs are spam
  • 11% of .org TLDs are spam
  • 4.1% of .com TLDs are spam


Forum discussion continues at WebmasterWorld.

posted Tamar Weinberg in Spam at March 20, 2007 9:43 AM Comments (13)

Yahoo! & Microsoft Release Papers on Web Spam

A WebmasterWorld thread links to a December 2006 paper at Yahoo! Research named A Reference Collection for Web Spam. The paper can be downloaded as a PDF file, it is not brand new, but relatively new. Here is the abstract:

We describe the WEBSPAM-UK2006 collection, a large set of Web pages that have been manually annotated with labels indicating if the hosts are include Web spam aspects or not. This is the first publicly available Web spam collection that includes page contents and links, and that has been labeled by a large and diverse set of judges.

Gary Price of ResourceShelf linked to an updated paper from Microsoft on Web spam. The 10 page PDF file is named "Spam Double-Funnel: Connecting Web Spammers with Advertisers." Here is the abstract:

Spammers use questionable search engine optimization (SEO) techniques to promote their spam links into top search results. In this paper, we focus on one prevalent type of spam – redirection spam – where one can identify spam pages by the third-party domains that these pages redirect traffic to. We propose a five-layer, double-funnel model for describing end-to-end redirection spam, present a methodology for analyzing the layers, and identify prominent domains on each layer using two sets of commercial keywords – one targeting spammers and the other targeting advertisers. The methodology and findings are useful for search engines to strengthen their ranking algorithms against spam, for legitimate website owners to locate and remove spam doorway pages, and for legitimate advertisers to identify unscrupulous syndicators who serve ads on spam pages.

So here is your weekend reading.

Forum discussion at WebmasterWorld.

posted rustybrick in Spam at March 16, 2007 7:41 AM Comments (0)

How Do Search Engine Robots Work?

I have always had a thing for spiders. Not the creepy crawly kind, but the one made of bits and bytes who scour the web for new documents to index and download. They are so predictable but at the same time quite surprising you when you least expect it. How the hell did they do that or find that page? Many a webmaster has scratched their hand in disbelief at a crawler at one time or another. There is a thread on WebmasterWorld asking new questions about the various characteristics of a how search engine crawling technology works and the bare bones infrastructure of how a search engine goes from finding a page to ultimately deciding to list it in its search engine results. This is the nuts and bolts of the technology and also updating previously known information with new questions and answers.

So how do search engine robots work and what comprises them?


Spider : a robotic browser like program that downloads webpages.
Crawler : a wandering spider that automatically follows links found on pages.
Indexer : a blender like program that dissects webpages that are downloaded by spiders.
The Database : a warehouse of the pages downloaded and processed.
Search Engine Results Engine : digs search results out of the database

Pageoneresults takes it a step further in creating this thread to ask new questions about search engine robots for those that are not previously familiar.


1. Do robots accept cookies?
2. What happens if my site forces a cookie?
3. Do robots execute JavaScript functions?
4. Could I be doing something technically that is stopping a robot from indexing my site?
5. How do robots interpret my page?
6. In what order to robots index my page? What is the very first step that robot takes?

Continued discussion on WebmasterWorld - How Do Robots Work?

posted Phoenix in Search Technology at January 10, 2007 9:02 AM Comments (0)

SEO is Magical Fairy Dust and That's Final

As juicy topics go, there's always "SEO is Dead", "SEO isn't Rocket Science", or is, and "classic" SEO is about as useful as sugarless jello. Starting off the year with a bang, Mike Grehan jumps in with a ClickZ article called SEO: Art, Science, Bollocks Or What? .

Taking a cue from more recent debates that brought out Danny Sullivan and Kevin Lee, Mike refers to a book called Web Dragons: Inside the Myths of Search Engine Technology, which he just digested. He asks us to take another look at our purpose for search in the first place and how we can apply new technology to better benefit business endeavors for the long haul.

In other words, marketing. Discussion is at Cre8asiteforums

posted cre8pc in Search Technology at January 8, 2007 10:16 PM Comments (2)

Bot Attacks: Yes It Can Happen To You

We all know about PPC fraud and that some of the fraud is caused by bots (robots) that click on the ads and drive up your bill and unwanted traffic. But it gets more serious than that. Bot are also used to steal your content, spam your site with comment spam, guestbook spam, dhtml spam and some very bad hacks.

Often, when someone writes a script to have a bot do any of the evil things they may do, they let the bot run wild. Sometimes that may take down your server.

Discovery at Search Engine Watch Forums links to a Wired article named Attack of the Bots.

The latest threat to the Net: autonomous software programs that combine forces to perpetrate mayhem, fraud, and espionage on a global scale. How one company fought the new Internet mafia – and lost.

Bots have gotten to us, they have. They got to WebmasterWorld, DigitialPoint Forums, Search Engine Watch Forums and many many other sites.

Discovery asks, not only in terms of PPC fraud and click fraud;

Have your concerns with bots grown over this past year?

I answer, Yes.

Forum discussion at Search Engine Watch Forums.

posted rustybrick in Spam at November 8, 2006 6:56 AM Comments (0)

Do You Care What The Search Data Says?

Or maybe I should change that to "Did you know that site search data has a story to tell?"

I found it interesting that there may be a lack of education on the part of SEO/M's, site designers, programmers and site owners (okay, all of us), on the value of website search. I'd come across a search analytics survey by Lou Rosenfeld and Rich Wiggins and decided to present it for discussion in What are the barriers to taking advantage of search analytics?

The reported verbatim answers to the survey on search analytics fascinated me, and so has the resulting conversation in the forum. Seems as though there's room for education on site search and its value, for those who offer site-wide searches. There are also tools that tell you what people are searching for that led to your site, but not much support on what you actually do with that information.

Every search phrase has a story. The survey supports the theory that there are "barriers" to "taking advantage of search analytics." Ammon Johns wrote:

"There were thousands of searches per day made, and trust me that the long tail was very visible.

It's clear, from the comments at Cre8asiteforums, that there are those who are fiddling with the data as best they can figure, and many more who don't know what to do with this pony.

posted cre8pc in Search Technology at September 27, 2006 7:01 PM Comments (0)

Search Ad Keyword Phrase Deletion Probabilities

Bill Slawski at Cre8asite Forums has a thread named Deletion Probablilities for Better Ads. In that thread he clearly explains a Yahoo! Patent named System and methods for ranking the relative value of terms in a multi-term search query using deletion prediction, the abstract reads;

The likely relevance of each term of a search-engine query of two or more terms is determined by their deletion probability scores. If the deletion probability scores are significantly different, the deletion probability score can be used to return targeted ads related to the more relevant term or terms along with the search results. Deletion probability scores are determined by first gathering historical records of search queries of two or more terms in which a subsequent query was submitted by the same user after one or more of the terms had been deleted. The deletion probability score for a particular term of a search query is calculated as the ratio of the number of times that particular term was itself deleted prior to a subsequent search by the same user divided by the number of times there were subsequent search queries by the same user in which any term or terms including that given term was deleted by the same user prior to the subsequent search. Terms are not limited to individual alphabetic words.

Bill explains the logic of the patent as a method of deleting the less relevant word, if the whole phrase of the search query does not match an ad within the ad inventory.

This could be done by looking at two word searches from users, and seeing if they might delete one of the words in a follow-up search. Search engineers might be able to set something up to find such deletions, and create a "deletion probability score" for terms.

More details at Cre8asite Forums.

posted rustybrick in Search Technology at June 19, 2006 8:10 AM Comments (2)

SEOs, Don't Be Fooled by Personalized Search Results

There is a thread with a fun name at High Rankings Forum named False Gods. The thread discusses how when searching for some "ego keywords" (keywords a person wants to rank well for) he found himself ranking well. But then he noticed that Google personalized search was turned on.

The results within personalized search, no matter which search engine, are tailored to your liking. So if you want to be number one for "seo" you can be over time. Especially if you use the remove result function until your site is #1 and also if you tend to click on your pages more often than others.

Past related article on this that may be of interest is named Search Engine Optimization is Changing So Quickly.

Forum discussion at High Rankings Forum.

posted rustybrick in Search Technology at June 14, 2006 8:28 AM Comments (0)

Google PageRank Patent Updated

The Google PageRank Patent has been updated the other day, the patent is titled Method for node ranking in a linked database. The abstract reads;

A method assigns importance ranks to nodes in a linked database, such as any database of documents containing citations, the world wide web or any other hypermedia database. The rank assigned to a document is calculated from the ranks of documents citing it. In addition, the rank of a document is calculated from a constant representing the probability that a browser through the database will randomly jump to the document. The method is particularly useful in enhancing the performance of search engine results for hypermedia databases, such as the world wide web, whose documents have a large variation in quality.

So what was changed? Bill Slawski says four things, but nothing substantial to the algorithm itself.

1. The references section was updated in this patent to include documents that are listed in the filing granted on September 28, 2004.

2. The abstract section remains the same in the new document, but the claims section was reduced in length, and appears to now cover aspects of both previous patent filings.

3. There are some minor looking changes in the “Detailed Descriptions” section between the version granted in 2001, and the one granted today.

4. The main changes appears in the summary section of the document. In the two previous documents, there were many passages that were repeated, but there were also differences. I’ve copied the areas of that section below where the three differ:

Continue reading the differences at SEO By the Sea.

Also Forum discussion at WebmasterWorld.

posted rustybrick in Google Optimization at June 8, 2006 7:32 AM Comments (0)

One Size Fits All: Optimize for Google to Optimize For Yahoo & MSN?

Over the past 6 months, ever since MSN really entered the search space, SEOs have now began talking about optimizing differently for each search engine algorithm. Back in the old days, there used to be a handful of search engines that people had to worry about. Then there was just Google, really. But not it is more diverse - we have four search algorithms to worry about. We have Google, we have Yahoo, we have MSN and we have Ask.com, which is increasing share each day.

ProjectPHP, Cre8asite Forum administrator, started a thread named Do Search Engines Use One Algorithm For All Results? The obvious answer to me is no, they all use different engines. But what you typically find these days are that people have one optimization strategy for MSN and Yahoo and then an other optimization strategy for Google. So how do you work that? You can cloak - oh no you can't! You can build different sites for different engines. Work different pages or subdirectories of your site for different engines. Or you can pay them al off. :)

But seriously, the thread gets into the heads of what is going on today, in the trenches of SEO. If I tweak for engine X, will that hurt my rankings on engine Z? Do you believe all engines share the same algorithm? I don't. Do you?

Forum discussion at Cre8asite Forums.

posted rustybrick in Search Engine Optimization at May 19, 2006 7:37 AM Comments (0)

Yahoo! Fighting Web Spam: TrustRank & Link Spam Patent Application

To clarify before even beginning, Yahoo! does not necessarily uses these techniques, they are just patent applications issued by Yahoo!

Bill Slawski posted an outstanding blog entry at SEW Blog named In Yahoo We Trust - The Link Spam Patent Application discusses one of Yahoo!'s papers and a patent application on fighting Web search spam.

(1) Combating Web Spam with TrustRank which discusses how non spam pages link to non spam pages, as Bill describes in short.

(2) Link-based spam detection which describes, similar to PageRank, the ability to "manually identifying reputable seed pages" and "separating reputable pages from spam pages."

Forum discussion on these topics at Cre8asite Forums & Search Engine Watch Forums.

posted rustybrick in Search Technology at May 5, 2006 7:54 AM Comments (1)

The Science Behind Google's Algorithms by Princeton University

Philipp Lenssen covered a new book named Google's PageRank and Beyond : The Science of Search Engine Rankings by Amy N. Langville and Carl D. Meyer from Princeton University Press. Reportedly, the book is incredibly mathematical, scientific and technical.

Jon Kleinberg of Cornell University (some of may you may know the name) gave the following review;

Comprehensive and engagingly written. This book should become an important resource for many audiences: applied mathematicians, search industry professionals, and anyone who wants to learn more about how search engines work.

Interested in a copy? Discuss it with others at DigitalPoint Forums.

posted rustybrick in Search Technology at May 3, 2006 7:56 AM Comments (1)

Google's Three Wireless Advertising Patent Applications

As Bill Slawski accurately notes in a thread he created at Cre8asite Forums named Google's 3 Wireless Advertising Patent Applications, branding, advertising, and subsidizing, discussion on wireless advertising patents have been the craze recently. News.com reports on it, first the pushed out a title that said Google has won the rights to the patent, but that was quickly corrected. But I much more prefer to read the Cre8asite Forum thread.

Bill summarizes the three patent applications.

(1) Method and system to provide wireless access at a reduced rate:

Methods and system for providing wireless access at a reduced rate. In one embodiment, access to a WAP is provided to an end-user at a rate subsidized by a first entity. The first entity includes advertisements in an end-user view.

Bill explains that this one has more to do with wireless access, which reminds me of a Gary post named Google Awarded Patent To Make Data Move Faster to Wireless Phones and Devices.

(2) Method and system to provide advertisements based on wireless access points:

Methods and system to provide advertisements in a view of an end user accessing a wireless access point. The advertisements are related to the WAP based on a predetermined criterion.

Basically discusses the "integration" of the wireless ads into wireless enabled devices, I believe there are some geo specific ads as well discussed here.

(3) Method and system for dynamically modifying the appearance of browser screens on a client device:

In one embodiment, a connection of a client device to a wireless access point is identified. Further, the appearance of a screen presented on the client device is modified to reflect the brand associated with a provider of the wireless access point.

This is basically about branding the ads with the WAP partner's logo and content.

So in short you have three patent applications from Google. One about optimizing the ads across wireless protocols. The second is about the integration of the ads and the third is about branding those ads.

Forum discussion at Cre8asite Forums.

posted rustybrick in Google Optimization at March 27, 2006 3:04 PM Comments (3)

GoogleBot Goes Wireless - Google Mobile Transcoding

It looks like we have a new GoogleBot that we need to worry about. The Google spider has the User Agent; "Nokia6820/2.0 (4.83) Profile/MIDP-1.0 Configuration/CLDC-1.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)" and comes from a Google IP address. The bot is not incredibly new, but it is picking up speed. Many Webmasters are noticing this new creature explore their sites. There is a whole section on the Google Remove URL page under the Remove transcoded pages anchor. This transcoding translates the page and strips out some html, which is upsetting to some. Mobile is important to Google, they have even added mobile queries to Google Sitemaps recently.

Forum discussion at Search Engine Roundtable Forums, WebmasterWorld and DigitalPoint Forums.

posted rustybrick in Google Optimization at March 20, 2006 8:05 AM Comments (0)

Search Advertising Patent Applications

Bill Slawski at Cre8asite forums started a thread he named Google advertising patent applications, have fun with a list of several fairly recent patent application on search advertising technology. Here is a quick summary of them:


  1. Adjusting ad costs using document performance or document collection performance
  2. Advertisements for devices with call functionality, such as mobile phones
  3. Facilitating the serving of ads having different treatments and/or characteristics, such as text ads and image ads
  4. Automated graphical advertisement size compatibility and link insertion
  5. System and method for rating electronic documents
  6. Results based personalization of advertisements in a search engine
  7. Rendering content-targeted ads with e-mail
  8. Selectively delivering advertisements based at least in part on trademark issues
  9. System and method for providing on-line user-assisted Web-based advertising
  10. Method and system for providing targeted graphical advertisements
  11. Generating user information for use in targeted advertising
  12. Systems and methods detecting for providing advertisements in a communications network
  13. System and method for enabling an advertisement to follow the user to additional web pages
  14. System and method for automatically targeting web-based advertisements
  15. Generating information for online advertisements from Internet data and traditional media data
  16. Promoting and/or demoting an advertisement from an advertising spot of one type to an advertising spot of another type
  17. Serving advertisements using a search of advertiser Web information
  18. Rendering advertisements with documents having one or more topics using user topic interest information
  19. Using enhanced ad features to increase competition in online advertising
  20. Method and system for dynamic textual ad distribution via email
  21. Serving content-relevant advertisements with client-side device support
  22. Serving advertisements based on content
  23. Methods and apparatus for serving relevant advertisements
  24. Method and system for providing advertising listing variance in distribution feeds over the internet to maximize revenue to the advertising distributor
  25. Method and system for providing filtered and/or masked advertisements over the internet
  26. Method and system for providing advertising through content specific nodes over the internet

Um, wow, ummm, now that is one huge list. Forum discussion at Cre8asite Forums.

posted rustybrick in Search Technology at March 17, 2006 8:44 AM Comments (0)

The Ultimate Forum Thread for Search Papers and Patents

I hope the member who posted the thread titled List of papers and patents?, Got one, guv? knew what he was getting himself into. All you need is Bill to see that and serve up one of the most comprehensive list of papers in a forum thread ever.

There is not much I can say, I am still a bit in shock from the list.

Forum discussion at Cre8asite Forums.

posted rustybrick in Search Technology at February 16, 2006 7:53 AM Comments (1)

Search Engines Find a Way to Gauge and Confirm Trust

It started with an interview I found on "Trustmarks", in which Paul Walsh, the co-founder and CEO of Segala M Test, was talking about a way to enhance personalized search by including a trust rating.

Perhaps you've heard of ICRA (Internet Content Rating Association) descriptors for child protection. I've had this code on one of my sites for years, since I wanted parents to trust it. (And being one, I care about that sort of thing.) Walsh was interviewed about the Segala trustmark scheme and their working with the World Wide Web Consortium (W3C). Segala is a founding sponsor of the Mobile Web Initiative (MWI) responsible for creating best practices and guidelines for the future Web on small screens such as PDAs and mobile phones. Paul Walsh is also a committee member of the Web Accessibility Initiative.

I was very curious about how search engines, or if search engines, would implement "Trustmarks". We started a thread at Cre8asiteforums about it, featuring the interview, and several members were most interested. Some wondered how this trust is tracked. What would stop anyone from being registered, getting a trustmark and then changing their content?

Paul Walsh dropped by when the thread first began to tell us more.

Continue reading "Search Engines Find a Way to Gauge and Confirm Trust"

posted cre8pc in Search Technology at January 27, 2006 3:42 PM Comments (1)

The Invisible Spider: Covert Crawler

A thread over at Cre8asite forums named New kind of spider is in town links to a Wired article named Covert Crawler Descends on Web. In short, this article describes a new kind of spider designed to crawl the Web as human-like as possible.

How Does it work?

The program comes from different internet addresses, simulates different browsers and throttles itself to human-like speeds... Hoffman's program downloads everything that comes with a page -- images, JavaScript and components like ActiveX and Flash -- instead of just hitting the page itself like traditional spiders do. It also simulates a full web browser, keeping a cache and requesting only new material... To select which links to click on, Hoffman has settled on a solution somewhere between a masterful AI and completely random selection. "In some ways it's a very simplified Turing test -- you can assign the different threads a personality. This crawler, you're the slow reader, you read the entire page." Another thread may spend less time on a page before it starts clicking on different links. "Each individual crawler has its own browser habits," he added.

Barry Welford calls this spider, "somewhat scary" and that I agree with. Ron Carnell has it right, "any robot that doesn't ask for and then follow robots.txt is, by definition, unethical." So Ron gives you a technique you can use to track and then block this type of bot.

Forum discussion at Cre8asite Forums.

posted rustybrick in Search Technology at January 17, 2006 8:52 AM Comments (0)

Researching Search Engine Results and How People Use Them for Research

I'm meddling in Bill Slawski's territory here, in that he's better known for presenting and analyzing papers on search engine technology. However, I caught this one and since it uses usability testing scenerios in the research, I gave it a shot.

The paper is Using meaningful and stable categories to support exploratory web search: Two formative studies by Bill Kules and Ben Shneiderman, of the Department of Computer Science, Human-Computer Interaction Laboratory and Institute for Advanced Computer Studies, University of Maryland.

The purpose of the study is to better understand how people use search engines to research topics - specifically, how categorization of search results applies to the end user experience.

"Categorizing web search results into comprehensible visual displays using meaningful and stable classifications can support user exploration, understanding, and discovery. We report on two formative studies in the domain of U.S. government web search that investigated how searchers use categorized overviews of search results for complex, exploratory search tasks."

They ran test subjects through a variety of tasks. Here is one example.

"Scenario 2 (Breast cancer) - You are a 30-year old journalist writing an article on breast cancer and what the federal government is doing about it. You are exploring the topic, starting by looking on the Web to find out what kind of information is available. You have just entered the search terms "breast cancer".

Continue reading "Researching Search Engine Results and How People Use Them for Research"

posted cre8pc in Search Technology at January 3, 2006 1:44 PM Comments (0)

Craigslist Blocks Most Spiders: Millions of Pages Delisted

A thread started at our forums named Craigslist Delists Millions of Pages from Search Engine Indexes uncovers the new robots.txt file in place over at Craigslist. It basically reads;

############################## # Exclude robots from these

User-agent: YahooFeedSeeker
Disallow: /forums
Disallow: /res/
Disallow: /post
Disallow: /email.friend
Disallow: /?flagCode
Disallow: /ccc
Disallow: /hhh
Disallow: /sss
Disallow: /bbb
Disallow: /ggg
Disallow: /jjj

User-agent: *
Disallow: /cgi-bin
Disallow: /cgi-secure
Disallow: /forums
Disallow: /search
Disallow: /res/
Disallow: /post
Disallow: /email.friend
Disallow: /?flagCode
Disallow: /ccc
Disallow: /hhh
Disallow: /sss
Disallow: /bbb
Disallow: /ggg
Disallow: /jjj


#####################################

They supposedly had millions, 3.6 Million to be exact, of pages indexed at Google and millions at the other search engines. Now? 211,000 at Google, 280,000 at Yahoo and 4,695 atMSN.

Forum discussion at Search Engine Roundtable Forums.

posted rustybrick in Search Technology at January 3, 2006 8:10 AM Comments (0)

Google Analytics (ex-Urchin) Delivers Web Analytics for FREE

Google has now re-branded Urchin to Google Analytics presenting users with better ways to “understand and influence visitor behavior and generate a higher ROI on marketing initiatives”. Yes folks! It’s offering a free hosted web analytics service, in hopes that advertisers, publishers and website owners will spend time understanding how people find their websites, navigate through them and convert on the goals of the site. With the free service, Google hopes it helps people spend money on their search marketing campaigns rather than on measurement. This is going to have a huge impact on both the search marketing and the web analytics industries. Draw your own conclusions.

But how much is really free? Google Analytics will allow you to track up to 5 million pageviews per month, no questions asked, no fees charged. So you have a BIG MONSTER website, then all they request is that you have at least one active Adwords account with an active campaign and spend $1 if you want, that’s all it takes. No more pageview caps. I’m sure they hope you spend much more than that when you see all the tracking benefits.

What’s more in this move, Google Analytics now allows integration with AdWords to better monitor “ROI metrics automatically without having to import cost data or tag keywords”, as well as tracking all of your other internet marketing initiatives as well. When you subscribe to it, you will see it as a new tab under your AdWords account. It now has executive, marketer, and webmaster dashboards for view quick summaries of “traffic, e-commerce, and conversion trends without hunting through reports.” Here is what else it offers:


  • Reporting interface accessible directly from the google.com/analytics website if you don’t have an Adwords account

  • Advanced visitor segmentation with over 80 web analytics reports

  • Ability to track up to 50 websites within your account

  • Site overlay

  • Funnel visualization

  • GeoTargeting with a cool map that shows where your traffic comes from
  • It’s available in 16 languages: Chinese (Simplified), Chinese (Traditional), Danish, Dutch, Finnish, French, German, Italian, Japanese, Korean, Norwegian, Portuguese, Russian, Spanish, Swedish and English.

  • And much more…


For those worried on privacy concerns, this is what they say, “Google takes the trust people place in us very seriously, and we are committed to safeguarding the privacy of your data. We understand that web analytics data is sensitive, so we accord it the ironclad protection it deserves. Google Analytics is subject to the same industry leading privacy policy as all Google services: http://www.google.com/privacypolicy.html

On a personal note, I’m also very excited with the steps Google is making because my consulting firm, iHispanic Marketing Group, is proud to announce that Google Analytics has chosen us as one among other Client Service and Support Consultants to service the global Hispanic market. With this strategic alliance we are committed to delivering professional services for training, advanced support, and expert web analytics consulting to executives, marketing managers and webmasters in both Spanish and English. Our loyalty we’ve had to Urchin and to our clients have demonstrated great rewards. Google Analytics will be a fun ride moving forward to continue building leadership with the Hispanic market for search engine marketing and internet strategy.

For discussion on this topic, you’re welcome to share your thoughts in the SearchEngineWatch Forum’s thread: Urchin Now Google Analytics, Now Free.

posted nacho in Tracking & Conversion Measurements at November 13, 2005 11:16 PM Comments (3)

Large Listing of Search Patent Application, Not from Google

Gary Price of Search Engine Watch took a look at some non-google patent applications from Yahoo, Microsoft and Others today. They include HP, Microsoft, Yahoo!, and Oveture (yahoo owned).

As posted by Gary;

Title: Method and system for identifying image relatedness using link and page layout analysis
Assignee: Microsoft

Title: Method and system for classifying display pages using summaries
Assignee: Microsoft

Title: Method and apparatus for performing a search
Assignee: Yahoo

Title: Method and system for ranking documents of a search result to improve diversity and information richness
Assignee: Microsoft

Title: Contextual flyout for search results
Assignee: IBM

Title: Method and apparatus for providing information
Assignee: Fujitsu

Title: Method and apparatus for identifying related searches in a database search system
Assignee: Overture/Yahoo

Title: Verifying relevance between keywords and Web site contents
Assignee: Microsoft

Title: Systems and methods that rank search results
Assignee: Microsoft

Title: Search systems and methods with integration of user annotations
Assignee: Yahoo

Title: Integration of instant messenging with Internet searching
Assignee: Yahoo

Title: Search system using user behavior data
Assignee: Microsoft

Forum discussion at Search Engine Watch Forums.

posted rustybrick in Search Technology at November 10, 2005 11:06 AM Comments (1)

Link Spam Detection Research Paper

Last night Gary blogged on A New Report on Estimating Link Spam. Gary explains that the "21 page (pdf) technical research paper from the Stanford InfoLab that takes a look at link spam." The paper was written by two folks at Yahoo and two at Stanford; Zoltan Gyongyi (Stanford), Pavel Berkhin (Yahoo), Hector Garcia-Molina (Stanford), Jan Pedersen (Yahoo).

Read Link Spam Detection Based on Mass Estimation if you dare. :)

Link spamming intends to mislead search engines and trigger an artificially high link-based ranking of specific target web pages. This paper introduces the concept of spam mass, a measure of the impact of link spamming on a page's ranking. We discuss how to estimate spam mass and how the estimates can help identifying pages that benefit significantly from link spamming. In our experiments on the host-level Yahoo! web graph we use spam mass estimates to successfully identify tens of thousands of instances of heavy-weight link spamming.

Forum discussion soon to be at this thread at Search Engine Watch Forums.

posted rustybrick in Search Technology at November 9, 2005 8:06 AM Comments (0)

Very Personalized Search Without Knowing It

Again, Loren Baker posts a thread at Search Engine Watch Forums named Google to Manipulate Organic Rankings with User Profile. In that thread, he summarizes a patent application named Personalization of placed content ordering in search results.

A system and method for using a user profile to order placed content in search results returned by a search engine. The user profile is based on search queries submitted by a user, the user's specific interaction with the documents identified by the search engine and personal information provided by the user. Placed content is ranked by a score based at least in part on a similarity of a particular placed content to the user's profile. User profiles can be created and/or stored on the client side or server side of a client-server network environment.

Loren explains that this is different then a Google AdWords patent, because this affects the core organic results. He expands; Such profiles are created by Google and gathered from previous queries, web navigation behavior via tracked links and possibly sites visited which serve Google ads, computers with Google Applications installed such as Desktop Search, Google Wi-fi Connection or Sidebar, and personal information which Google identifies which may be “implicitly or explicitly provided by the user.” Loren wrote a Search Engine Journal article on it named Google Patent : Organic Results Ranked by User Profiling with more information.

Forum discussion at Search Engine Watch Forums.

posted rustybrick in Google News & Press at November 3, 2005 8:24 AM Comments (0)

On-topic analysis and C-Index For You Smart People

Regardless of the fact that Randfish says he made a mistake at SEOChat, I can vouch for the man's integrity and ability to understand the nearly impossible.

At the NYC SES Conference, I sat in the front row, in between Rand to my left and Bill Slawski, to my right, pretending to understand Dr. Garcia's (aka "Orion's) power point presentation on search engine algorithm technology. My eyes glazed over exactly the way they used to do in Math class.

Meanwhile, Bill and Rand were muttering outloud various things that convinced me that I was sitting between two geniuses. They understood the diagrams on the wall.

Rand, though admitting he may have said something erroneous, still understands C-indexing, and in his SEOmoz post today, points to some resources on term vectors, on-topic analysis and the mysterious and anal world of keyword density. Go there now, and read it all in the privacy of your own home or office cube, where nobody is watching to see if you really know what the heck all this stuff means.

I won't tell.

posted cre8pc in Search Technology at October 17, 2005 3:40 PM Comments (1)

AdWords Trademark Verification Technology Patent

Phew, that was a hard title to come up with and I am still unsure if it properly represents the latest patent application that has been filed by Google named Selectively delivering advertisements based at least in part on trademark issues. Based on the abstract;

A system and method for selectively delivering legal information communications for documents (e.g., advertisements). An input is received, wherein the input is a document delivery-triggering event operative to cause a document to be delivered to a user. A location associated with the input is identified. Based at least in part on the location, it is determined whether to provide a legal information communication. A document is delivered based at least in part on the input, wherein the document is delivered with a legal information communication if the location is determined to be in a legal information communication jurisdiction.

It seems like they are going to be offering those with trademarks to submit electronically proof of trademark infringement. The system will automatically do its best to verify the documents submitted and process ads based on that data. I am not about to read the whole patent application now, but it can just add to both of our light weekend reading.

Bill at Cre8asite forum has the thread on this, which he named Trademarks and Google Ads (a much simpler title, but it says a lot). He explains that the patent application provides "a technical framework for considering trademark issues when serving advertisements." He adds;

One aspect of it would serve a legal disclaimer in some jurisdictions to limit consumer confusion. In other jurisdictions, it may not serve an advertisement at all based upon the laws of that region. The document provides thoughtful technical framework for handling trademarks in ads.

posted rustybrick in Google AdWords at October 7, 2005 9:03 AM Comments (0)

Scary Google Patent Application & Some Gmail Patent Apps

Msgraph, known for his postings of complex patent applications, posted two new threads at Search Engine Watch Forums.

The first I want to share with you is one that might make you cringe. The thread is titled, Patent App For Behavioral Monitoring Desktop Application. Msgraph explains that this is really scary stuff;

Imagine all of your actions being constantly monitored in order to build personalized search queries. Those last words you typed in a Word document or IM window. The e-mail you just sent. What words your cursor is next to. The text you copied to the clipboard. All of it constantly monitored and processed, in real-time, locally and/or using Google's search engine in order to build search results for you in case you need them at a moment's notice.

Little brother. :)

The next thread msgraph started is on Gmail patent application requests, he titled the thread GMail Patent Application Bundle. It is a bundle of patent applications, because he linked to six different applications in that one thread. Here they are;

There is your light weekend reading for you.

posted rustybrick in Other Google Topics at October 7, 2005 8:55 AM Comments (0)

Five New Search Patent Threads at Cre8asite Forums

Cre8asite Administrator Bill Slawski does it again by posting five (yes you heard me right) different threads discussing two Google patents, two Yahoo patents and one MSN patent. Lets start with Google.

Google Patents:
- Variable personalization of search results
Bill explains;

This invention would enable a searcher to fill out a profile, perform a normal search, and then use a slider button to indicate how much his or her personal information from the profile should be used to modify (rerank) that search based upon the personalization information that they have entered into the profile, by sliding the button partially, or all the way to a full influence on the results.

- Google Help with Advertising Creatives
Bill explains;

This invention is aimed at novices to advertising, and to individuals or small business owners that may not have a web presence, or may need assistance in coming up with creatives, keywords, etc... The description of their "Advertising Generation Engine" is interesting, including the generation of a creative. I'm wondering where they will be getting their "eye catching images."

Yahoo Patents:
- Yahoo! Color Graphing and Personalization
Bill didn't have enough time to summarize, but the patent application's abstract reads;

In a search processing system, identifying input authority weights for a plurality of pages, wherein an input authority weight represents a user's weight of a page in terms of interest; distributing a page's input authority weight over one or more pages that are linked in a graph to the page; and using a resulting authority weight for a page in effecting a search result list. The search result list might comprise one or more of reordering search hits and highlighting search hits.

- Inverse searches and User Annotations
Bill explains;

Imagine being about to enter a URL, and find out information about the site, collected over time, such as which other sites point to it, what other people feel about it, and more. A couple of new patent applications from Yahoo! cover these types of topics.

This is really two patents in this one thread.

MSN Patent:
- Assigning textual ads based on article history
Bill explains;

There's something to this new patent application from Microsoft that reminds me of the movie Minority Report, where the protagonist goes through a shopping area, and his past purchases inform the billboards and advertisements of his potential future interests... Collecting keywords from pages that you've visited in the past, to decide what ads to show you now? It's an attempt, somewhat, to get away from contextual based ads that may not be appropriate...

Makes for some nice light reading... :)

posted rustybrick in Search Technology at September 29, 2005 9:16 AM Comments (0)

Geocoding Patent Awarded to Google

For those of you who stay on top of those Google Patent, its important for you to know that they were just awarded a patent named Address Geocoding. Gary Price explains;

It's quite easy to envision how this technology could be used to identify and map info based on what's listed on a web page or other document. It also might be used to help identify local search results, personalized results (based on a users address) and when and where a paid ad would be visible on a results page based on the location of a searcher.

Patent member, msgraph, posted a thread about the topic a couple hours before Gary's post at Search Engine Watch Forums. Normally, Gary beats everyone to the patent announcements, I guess he was holding back on this one. :)

posted rustybrick in Google Optimization at August 24, 2005 8:47 AM Comments (0)

The Average Google Searcher Spends 7 Seconds...

SP32-20050804-104047.gif
The average Google searcher spends less than seven seconds looking at a search results page before they make a decision to click. Your challenge... get them to notice your ad or organic listing.
I got a call this morning from Marketing Sherpa informing me of the complete eye tracking study has finally been released by Enquiro Search Solutions, including 64 screenshots, 37 charts on data collected from an ongoing study on the behavior of the eye on search results.

At SES New York, I wrote about Eye Tracking and Search Combine to give us the Golden Triangle, that magic "F" shape area that the eye scans a search engine result page. I was pretty interested in the study and found it could help answer and generate a good amount of questions. Based on the complete study they have defined five specific patterns people tend to use depending on where they are in the sales/educational cycle, such as the Quick Click, Linear Scan, Golden Triangle Scan, Deliberate Scan, and Pick Up Search.

Other questions we have all wondered about at one time or another:


Does adding prices to your copy make a difference in the way people look at your ads and organic listings?

How does the use of bold affect the way people read your copy?

Should the keyword always be included in the title? (Worth noting if you're using ad groups for lots of niche keywords.)

The report for purchase can be found here on the Marketing Sherpa site. You will probably want to visit that one first if interested. I could not find any forum threads on this topic.

posted Phoenix in Search Technology at August 4, 2005 11:17 AM Comments (2)

Whats Does A Search Engine Spider Look Like?

So what does Googlebot look like? I have heard crazy stories that it resembles something like a scary sea monster, others have said you can't see it unless you have special decoder glasses and a fast internet connection. For the most part I believe thats a bunch of nonsense but some members at Cre8asite Forums are having some fun and trying to get to the truth of what a search spider looks like. They take a creative look at what we know about the known territory and existence of such the search engine spider creature and dive into what its others are saying about it.

Ammon Johns, quotes expertly that they are "powerful elemental entities that can be harnessed in a small container, and can cause insanity in those they torment." Yep, I would say thats about right in most cases. I have known a few SEO's and webmasters to go crazy in the pursuit of a spider from time to time.

Here is a reported picture of Googlebot as drawn by one of Google's own employees. Seems more like art to me.

For more discussion about what a search engine spider looks like navigate over to Cre8asite Forums.

posted Phoenix in Search Technology at June 13, 2005 2:52 PM Comments (3)

Relevance Defined by the Scientists

Most of you know about my little project, The Search Engine Relevancy Challenge. Outside of user's perceived relevance of a search response, how do the PhDs and scientists define relevancy. I particularly like the way Orion clearly described three ways to measure relevancy in a thread started by another researcher named nanocontext, the thread was titled The relevance of "relevance". Orion said that "relevancy has a lot to do with perception" and then he pulls out three types of "perception".

1. Which content is relevant according to user's perception?
2. Which content is relevant according to scoring functions used by a machine (IR system or search engine)?
3. Which list of content (documents) scored and already prequalified as relevant by a search engine algorithm are actually relevant according to user's perception and to the query that has been used?

Orion says that we are trying to measure number three, with RustySearch (by the way, please make this your default browser for the next two weeks to help the study). Nanocontext believes that "#3 is the most critical question, because thats where the money is." In addition, I am told that I should refresh my memory on the topic of "precision versus recall", which I promise to do and write a brief entry on it here. This thread, of course, sprung my interest.

posted rustybrick in Search Technology at May 11, 2005 8:46 AM Comments (0)

Stuff I've Seen (SIS) in the Future of Search

You will be the search agent crawling the web for the search engine. You will be able to index what you see when you see it, from email to video to webpage. Search will go beyond the query box and into your personal space, it will be shaped along with you, and not against you. Search engines are heavily looking to the future of search this year and deciding to make give you a bigger part in how you search the web. In At a very interesting thread at SEW, Orion posts on a presentation from MSN's research specialist Susan Dumais that explores the new technology Stuff I've Seen (SIS) and its implications for personal information management and as they put it "Helping finders become keepers." I am particularly excited about this new technology, more so than I would have imagined, as I see it really changing the way we use the web in the future.

Orion mentions that while the idea of this is not new, the technology to make this happen was always a barrier, but today that is not the case. Nacho theorizes that what we are seeing in science, technology, and marketing today will one day make our industry more important than television. Another member Xan comments that the researcher from MSN never believed in the idea of a semantic web and that such ideas were nonsense and impractical in the way for which we "want" to use search, not the way we will be told to use search in the future. I explore the area from the ability for search engines to index anything and everything, with the inate ability to index as we see something. Personally I find this as an intrusion, and would not want a search engine to index everything I see. Additionally I question how as marketers this new technology could have a impact on us. Since personalization and integration of this will take such a course in many directions, will there be any common reference points we can share with others?

Ms. Dumais presentation goes on to talk about search today, and how there are many information silos that can be indexed in order to grab documents. However doing this can be slow. She talks about how you might have the option of opening up your massive digital libraries to the world, or just for yourself so you can search them easier. One of the barriers mentioned is that as information libraries grow, it will become harder and harder to locate documents within them. Her presentation provides examples of SIS in use, and the current testing that is underway with a group of 3000 people. Not surprisingly, 76% who use SIS technology are using it to find email, with about 14% looking for web pages, and a good majority of people looking for documents over a month ago. There are some interesting implications for this technology such as the intregration with TV programming, and the ability to index even things watched on right then and there. Imagine going back and searching through a TV documentary you watched over the last year. Or going as far as a not to distant reality for marketers to say "target only women age 25-35 on a query for "chocolates" during february days." Pretty cool.

Continued discussion on Stuff I've Seen at SEW

posted Phoenix in Search Technology at April 27, 2005 2:09 PM Comments (2)

Gary Price's List of Search Engine Patents

Gary Price, from Resource Shelf and the SEW Blog, posted a nice updated collection of both Yahoo's patents circa 2003 and MS search patents.

Some more weekend reading for you folks.

posted rustybrick in Search Technology at April 22, 2005 8:12 AM Comments (0)

Weekend Reading: Rankings, Link Farms, Personalization, and PageRank Collusion

If you're the type that does a lot of offline reading on the weekends, make sure to check out Gary Price's entry at SEW Blog. He has linked to many fun and excited (technical and dry) research papers related to Rankings, Link Farms, Personalization, and PageRank Collusion. What is even better is that he plans on doing this more often, "installment number one."

posted rustybrick in Search Technology at April 8, 2005 1:47 PM Comments (0)

Bookmark Data for Ranking Purposes

The ultimate vote for a page is if someone bookmarks that page for later use. Well, maybe it is not the ultimate vote, since I have tons of orphaned bookmarks that I never visit. But if search engines can capture one's bookmarks with date stamps including frequency of use and date added, that can be a valuable measurement used in determining page importance.

A thread at Search Engine Watch forums, and I am sure discussed at many other forums (it has been a busy week), members discuss this as a possibility. In fact, in the recent patent released by Google, it discusses more then sandboxing concepts, it discusses monitoring "data maintained or generated by a user, such as "bookmarks," "favorites,"." Nacho pulled an excerpt for that portion, in the thread;

"According to an implementation consistent with the principles of the invention, user maintained or generated data may be used to generate (or alter) a score associated with a document. For example, search engine 125 may monitor data maintained or generated by a user, such as "bookmarks," "favorites," or other types of data that may provide some indication of documents favored by, or of interest to, the user. Search engine 125 may obtain this data either directly (e.g., via a browser assistant) or indirectly (e.g., via a browser). Search engine 125 may then analyze over time a number of bookmarks/favorites to which a document is associated to determine the importance of the document.

[0115] Search engine 125 may also analyze upward and downward trends to add or remove the document (or more specifically, a path to the document) from the bookmarks/favorites lists, the rate at which the document is added to or removed from the bookmarks/favorites lists, and/or whether the document is added to, deleted from, or accessed through the bookmarks/favorites lists. If a number of users are adding a particular document to their bookmarks/favorites lists or often accessing the document through such lists over time, this may be considered an indication that the document is relatively important. On the other hand, if a number of users are decreasingly accessing a document indicated in their bookmarks/favorites list or are increasingly deleting/replacing the path to such document from their lists, this may be taken as an indication that the document is outdated, unpopular, etc. Search engine 125 may then score the documents accordingly."

So the thread asks, will you soon see more Web site asking you to "bookmark this page"? Or better yet, will they run scripts that automatically bookmark the page for you, without your knowledge?

posted rustybrick in Google Optimization at April 5, 2005 3:59 PM Comments (1)

Become.com's AIR Out-Ranks Hilltop

Become.com, a relatively new player to the shopping search engine game has been getting a lot of press over the past few months. One such topic is on their vertical search technology named AIR. Today, Jason Dowdell interviewed Become's CTO Yeogirl Yum asking some targeted questions on the AIR (Affinity Index Ranking) technology. One thing that stood out in the interview was:

AIR is significantly more advanced than that hilltop algorithm. The hilltop algorithm (as described at www.cs.toronto.edu/~georgem/hilltop/) considers only links from a limited number of "expert" sources when identifying target web pages. According to the hilltop paper, "the targets are then ranked according to the number and relevance of non-affiliated experts that point to them. When such a pool of experts is not available, Hilltop provides no results. Thus, Hilltop is tuned for result accuracy and not query coverage."

AIR, on the other hand, evaluates connectivity between all pages in a given topic. Rather than focusing on "top of the hill" sites, AIR understands the overall network of sites within a topical area. Both inlinks and outlinks are evaluated to understand the level of interconnection among the sites. Advanced mathematics and concepts from Applied Physics and Engineering Dynamics are used to calculate specific scores.

Well, I am no scientist, but I do understand that it is very difficult to, in real-time, evaluate "connectivity between all pages in a given topic." Limiting it to structured data makes it easier, in addition, if we can limit it to a specific vertical, even easier.

posted rustybrick in Search Technology at March 21, 2005 12:55 PM Comments (0)

Does Google Looking at Terms within Context of a Document?

An interesting thread sprung up at Search Engine Watch forums named Does google attempt to put terms in context? In this thread a member asks "How intelligent is google when putting terms in the correct context?" The example the member uses is "Wristwatches". He has a page on Wristwatches and all the areas of the page are focused on that term, Wristwatches. If he had other verbiage, that relates to wristwatches, such as "Citizen", a brand of wristwatches, would Google recognize the difference. Would citizen be understood by Google as a brand of wristwatches or would it be understood by Google as a legal resident of a specific country?

Wow, that makes for a great question. And it led me back to the February update, where the topic of LSI or Latent Semantic Indexing was the latest craze. In an other entry related to this, Daron Babin (aka SEGuru) was quote explains it well in layman terms: "He recommends writing a page of content and pulling out the keywords, then give it to someone and ask them to figure out what they keyword is. He said its about the other words on the page, its that important. If the keyword is "apple" is the page about computers or fruit?"

In the thread, I posted a reply referencing Google Suggest. As a way of example, I said "Type in "citizen", as you start typing it in, it will suggest popular searches. You will notice that the 3rd suggestion is, in fact, citizen watches." It makes you wonder, if they can use this on a large scale, in real time, and return results within milliseconds...

posted rustybrick in Google Optimization at March 16, 2005 12:06 PM Comments (2)

Orion, Dr. Garcia, Warns SEOs on Overlapping Patterns

The resident scientist to the search engine optimization community, Orion (Dr. E. Garcia), wrote a new short article named Overlapping Patterns: EF-Ratios, Separators, Patterns and Pitfalls; "Overlapping patterns within overlapping patterns! It takes few seconds to realize that query mode implementations can exhibit a fractal component."

In this article, he reviews the concepts of his creation, "EF Ratios", discusses Separators (spaces, delimiters, or stopwords) & EXACT queries (""), and Patterns & Overlapping Regions. But most importantly he explains why some of the "pitfalls SEMs/ SEOs should avoid" in terms of using keyword research tools. I'll quote the last paragraph here:

Ultimately, metrics based on search results are not just affected by tokenization and similar procedures taking place at the level of the individual IR architectures. They can be the result of relevancy scores assigned by the queried system. Combining these metrics with metrics from other search engines that account, let say, for user's query behaviors (i.e., search volume) to come up with a new metric is a highly questionable approach. SEOs, SEMs and keyword research firms should stay away from such practices.

posted rustybrick in Search Technology at March 8, 2005 11:32 AM Comments (1)

Speed Tests: Testing Google, Yahoo and MSN

I got to run to a meeting, so I won't be able to update this site for a few hours. Found this post at Cre8asite that seems interesting, it is named A small speed test for the three major SE's.

Speak to you all later.

posted rustybrick in Search Technology at February 7, 2005 9:12 AM Comments (0)

More On Latent Semantic Indexing

Yesterday I wrote an entry named Latent Semantic Analysis (LSA) - Crawl into the Google Algorithm?, where I discussed how the current theories behind the Google SERP changes have to do with a new algorithm shift for Google. Now many believe that this has a lot to do with Latent Semantic Indexing. So now, as a SEO, if you haven't already, its time to read up on all the papers on this topic. I, Brian posted a new thread with resources to papers on the topic, he thanks SEW Moderator Marcia for the help with the papers. I'll list links to those papers below. Then Ammon Johns posts a quote from one source that really does a great job summarizing the topic. In addition, he posts to a thread on this topic started in 2002 at Cre8asite Forums named The Semantic Web.

Here is the snippet Ammon quoted in the SEW thread:

Regular keyword searches approach a document collection with a kind of accountant mentality: a document contains a given word or it doesn't, with no middle ground. We create a result set by looking through each document in turn for certain keywords and phrases, tossing aside any documents that don't contain them, and ordering the rest based on some ranking system. Each document stands alone in judgement before the search algorithm - there is no interdependence of any kind between documents, which are evaluated solely on their contents.


Latent semantic indexing adds an important step to the document indexing process. In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. LSI considers documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant. This simple method correlates surprisingly well with how a human being, looking at content, might classify a document collection. Although the LSI algorithm doesn't understand anything about what the words mean, the patterns it notices can make it seem astonishingly intelligent.


When you search an LSI-indexed database, the search engine looks at similarity values it has calculated for every content word, and returns the documents that it thinks best fit the query. Because two documents may be semantically very close even if they do not share a particular keyword, LSI does not require an exact match to return useful results. Where a plain keyword search will fail if there is no exact match, LSI will often return relevant documents that don't contain the keyword at all.


[ Source: http://javelina.cet.middlebury.edu/lsa/out/lsa_definition.htm]

Here are a listing of papers on the LSA topic from the thread:

Added: Check out SEO Book's LSI post, very detailed and easy to read. Good work.

posted rustybrick in Search Technology at February 4, 2005 8:57 AM Comments (2)

What is Relevancy in Terms of Search

Determining what is relevant can be an incredible task. It is the goal of all search engines to figure out what is the most relevant Web pages for that particular searcher based on the one, two, three or four words they type into the query box. It is not easy because of the subjectivity involved in the query process.

A thread at Search Engine Watch Forums named What is Relevancy discusses just that. Orion, the resident search technologists, gives the text book scientific definition, "A judgment which relies heavily on semantics." But then he goes deeper into how someone in the information retrieval field would determine such a thing. He says you set a hypothesis, such as "If a document and a query have a (key)word in common, the document is likely to be relevant." And then you try to disprove it, he adds, "Some IR systems precisely are designed for the sole purpose of disproving this hypothesis."

Mel then goes into the subjectivity of such a question asking; "Relevancy to the words searched for in a search engine? Relevancy to the topic searched for in a search engine? Relevancy regarding links? Relvancy for the searcher?" and so on. There are others the chime in to the thread, such as ProjectPHP. And then Danny Sullivan gives a great recap at this post.

posted rustybrick in Search Technology at January 5, 2005 8:30 AM Comments (0)

URL Normalization: Is a Trailing Slash the Same Page

There is a very interesting thread brewing at Search Engine Watch Forums named Is A Trailing / On A Directory Seen As A Differnet File By Google?. In this thread a member lists an example of the same page, different URLs due to the trailing slash, have different PageRank values. His example is:

http://www.avismauritius.com/en/locations/ PR=3
http://www.avismauritius.com/en/locations PR=0

In the thread, Orion, the resident search technology guru at SEW forums, discusses how search engines normalize the URLs in order to give each URL a unique identifier. I hope that I explain this correctly. It is my understanding that the unique identifier is a hash string, possibly a 64 or 128 bit hash string. In order to assign a unique identifier, the URL needs to be stripped down and normalized. The process is a bit like Orion stated:

Removal of the protocol prefix (http://) if present Removal of a :80 port number specification if present (However, non-standard port number specifications are retained) Conversion of the server name to lower case Removal of all trailing slashes ("/")

However, this does not really explain if Google does all or some or none of this. Moderator Chris_D referenced an old WebmasterWorld thread where GoogleGuy sheds some more light on this topic. He talks a lot about http responses and URL requests, but the important line to get out of the thread is "I would always recommend the trailing slash. If you know the exact right url, it's often best to give it directly and save everyone that extra redirect." You also might want to check out msg # 6 in that thread.

PageOneResults from the SEO Consultants Directory explains that this is more of a matter of "content negotiation". He goes on to explains;

The W3C and other large website structures are now utilizing content negotiation. That means that this...

www.example.com/sub

...could be different than this...

www.example.com/sub/

With the use of content negotiation, there are no file extensions. Basically you are cleaning the URI of all underlying identifying technologies.

Bottom line, the same URL with and without a trailing slash can and is considered different to most search engines. Most are weeded out through the use of duplicate content filters, and most sites do not have this problem because of the built in way the server handles these URL requests.

posted rustybrick in Search Technology at December 28, 2004 3:00 PM Comments (0)

PubSub LinkRanks Deploys Form of Temporal Link Analysis

Back in October we highlighted a thread started by Orion on the topic of Weighing the Time of a Link: Temporal Link Analysis, which discusses various algorithmic formulas that can be applied to link data in order to improve search results. The thread now gets a more practical when Bob Wyman, CEO of PubSub.com shows how his engine ranks pages. As opposed to most link algorithms, PubSub.com, DayPop.com and PopDex.com all use a form of time based link analysis to define which pages are most popular NOW.

On the PubSub LinkRanks explanation page it describes how it all works. Basically, it first collects and maps the linking data from weblogs in its index, then it assigns values for each link pointing to a page, then applies a
"Link Scores for Each Domain" and finally they apply the time factor of the link data by weighing "the trailing ten days' link scores by factors of 2".

Bob joins the Search Engine Watch thread named Temporal Link Analysis to add his practical experience with assigning time metrics to linkage data. Orion asks some excellent questions, join in.

PubSub asks that I include this link that goes nowhere, http://psi.pubsub.com/20040413:linkranks:1 "to see if we can construct a conversation thread around the topic by using a common URN."

posted rustybrick in Search Technology at November 22, 2004 10:11 AM Comments (0)

Hilltop and Theme Based Optimization

Ever since the beginning of 2004, after the Florida update in November 2003, there has been discussion of the hilltop paper and "theming" a site's pages. In fact, read some of the past entries on the topic written here named:
- Dan Thies Writes on the New Google
- Webby's Back - Topic Austin Theory
- Authority Can't Do it Alone - Bring Out the Hub
- True Meaning of Themed Sites & The Level of Importance in the Ranking Algorithms

A topic over at WebmasterWorld named Anyone besides me not swallowed the "Hilltop" magic pill yet?, started by moderator BakedJake, discusses this topic. It is true, in my opinion, that page ranking at Google has nothing to do with themeing your pages in your site. I have spoken with experts who tell me this is the case and even Google Reps hinted at this.

The question is, "When you play chess you try to anticpate your opponents most likely next move(s), and make make your move accordingly. So planning today for hilltop tommorow, isn't a bad idea." That is for a different thread. :)

posted rustybrick in Search Technology at November 4, 2004 9:23 AM Comments (0)

Weighing the Time of a Link: Temporal Link Analysis

Dr. Garcia (aka Orion) over at Search Engine Watch forums created a thread named Temporal Link Analysis, which discusses a paper presented by IBM researcher Einat Amitay. This paper discusses the difference between how journal citation and Web IR citation. The basic premise is that the more often AND the more recent those citations are, the more important the journal is.

If we apply the concept of temporal link analysis to the much manipulated linkage structure of the current Web index, then we can possibly provide more relevant pages to the search user. To achieve this, the search engines would have to accurately capture time data in both the last update of the document as well as the last update of the link found within the document. Capturing this data is no easy task, nor is processing this information. But if it can be done, then the search engines can assign higher weights to more recent links as opposed to older links.

Why would you want newer links to be worth more? There are (generalizing here) two types of Web pages; (1) a page that is updated on a constant basis (be it daily, weekly, monthly) and (2) a page that is put up once and left there forever. Page number two, the one that is written once and left alone for ever, is by its very nature, outdated and irrelevant. The information, the links and the citations from this page launched in 1996 are most probably gone, misplaced or outdated. However if a page that was launched in 1996, but is updated on a monthly basis, contains links (i.e. citations), it can be assumed that this page is a "timely authority" and thus can be assigned a higher weight.

I was in the process of developing a small excel worksheet with the appropriate weights I feel should be associated with pages based on this paper, but I stopped. Instead let me just give you my thoughts in words, and you can argue. :) Pages that are old, but are not updated ever should be assigned a very low weight (possibly close to 0 weight). Pages that are new, and not updated (within a year or few months), should be given a higher weight then the old page that is not updated, however, this page is still not an authority, so it should be a relatively low weight (possibly close to .1). Now, pages that are new but updated often, relative to the data the page was first created (or found), should be given a higher weight then the above (possibly close to .2). And pages that are old and updated on a frequent basis relative to the date the page was first created, should be given the highest link weight (possibly close to .4).

Keep in mind, there are of course other factors. In the paper it discusses, DIPs (Dated Inline Profiles) where you associate a profile or community. You look at the dated inlink of a page, associated with a topic/concept/community to determine its value. Ok, I am stopping there. :)

posted rustybrick in Search Technology at October 25, 2004 9:43 AM Comments (0)

Orion, Dr. E. Garcia, SEW Forum Posts

I normally do not pick out individual members at forums but one such member deserves a special mentioning. This member, now an SEW Forum Moderator, goes by the user name Orion, but is formally known as Dr. E. Garcia. SEW Forums, in my opinion, is lucky to have such a unique individual participating daily in the forum.

Lets just take a look at a selected number of Dr. Garcia's threads:

All of these posts have impacted my views on my understanding of search technology. This one individual deserves or thanks.

Thank you Dr. Garcia, a.k.a. Orion.

posted rustybrick in Search Technology at October 21, 2004 7:40 PM Comments (0)

Block Analysis with Image Retrieval

What are some of the methods to combat the issues with linkage data, that search engines often depend upon for ranking one page over the next? If you're reading this blog, you most probably know all too well about how sites exchange links with each other, how people buy text links and all the spam found in blogs and other open sites. So how does a search engine weed out the 'meaningless' or 'less important' links from a Web page?

One answer is something called "block analysis' and there is an excellent thread going on at SEW Forums named Block Analysis 101. I will warn you, it gets a bit technical. Let me pull one concept out of the thread, and hopefully come back to this thread at a later time to discuss the rest.

How does a search engine look at the "blocks", "passages" or location of the content on the page as would a human? With the use of CSS, it can be very hard for an engine to understand which content goes with which links. The goal is for the engines to look at a page, understand the blocks within the page and then assign appropriate weights to the content and links based on which 'block' the content and links are found.

For example, take a look at the image below of a typical content site. You will see how I separated out the major components of a page's layout. Removing all the fluff, when a human finds the page he or she is looking for, they want to simply focus on the middle portion of the page, "content area". And one would expect that the links and content within the 'content area' is the most relevant to what this page is discussing. If a link is found within that section, it is sometimes (we are now finding contextual based ads within the content of the pages, dynamically changing words in passages, based on a keyword match, to link to an advertisers site) good to assume that the link is important, in fact, it is probably one of the most important links on the whole page. Search engines know that.

block-analysis-with-image-a.gif

Some search engines are experimenting with a form of image retrieval. Where the engine will capture an image of the page, break the image out into blocks of passages (as would a human) and then assign the appropriate weights to the various blocks of content. So now the "text ads" on the bottom left will be worth a lot less then the links found within the left nav, and even less then the links found within the content area.

posted rustybrick in Search Technology at October 14, 2004 9:24 AM Comments (0)

Online Discovery of Secondary Terms Associated to a Theme Experiment

Orion, a member at SEW forums, has decided to allow some of the SEW Forum members participate in an experiment named "Online Discovery of Secondary Terms Associated to a Theme Experiment". The thread named Call for five SEOs, is where Orion explains the study and asks for 5 volunteer SEOs to submit a keyword phrase that meets the experiments guidelines. I am pretty sure all the slots are filled, but maybe he can squeeze in a few.

Orion explains, this experiment pretends to discover secondary keywords associated to a theme represented by an initial key phrase.

Two keywords need to be submitted that have a c-index of 25 points or higher. If your interested check out the thread here.

posted rustybrick in Search Technology at September 5, 2004 10:45 AM Comments (0)

Bombing the Search Engines: The Real Search Wars

An excellent post by Orion on the topic he named Who bombs Whom?, which he admits he should of named "The Good, the Bad and the Ugly Queries". Here is does some queries on keyword phrases at the various search engines.

A search for "bad search engine" without quotes shows that
1. In Google, Yahoo.com is #2 out of 4,650,000 results
2. In Yahoo!, MSN.com is #1 out of 6,330,000 results
3. In MSN, MSN.com is #1 out of 1,323,674 results.
4. In Teoma, GO.com is #2 and Search.com is #3 out of 3,178,000 results.

A search for "good search engine" without quotes shows that
1. In Google, DogPile.com (a metaengine) is #1 out of 8,520,000 results.
2. In Yahoo!, DogPile.com (a metaengine) is #1 out of 15,000,000 results
3. In MSN, DogPile.com (a metaengine) is #1 out of 3,112,205 results.
4. In Teoma, Yahoo.com is #1 and Teoma is #3 out of 9,448,000 results.

Check out the thread.

posted rustybrick in Search Technology at August 9, 2004 8:37 AM Comments (0)

Block (Passage) Level Link Analysis by MSN

With all this discussion abut the problems with PageRank and HITS, Microsoft released a paper recently discussing its solution for the faults in PageRank and HITS. The basic premise of the article, which can be downloaded here, is that the faults are that all links on a single page are not equal. By breaking up the page into "blocks" or "passages" (as Orion likes to call them in the thread at Search Engine Watch), you can semantically understand what sections of the page is about what. And then based on the mathematical location of links, determine the weight and relevancy of that link.

Very interesting idea, of course this can be abused as well. I for one would love to see this working at MSN Search. For discussion, please join the Search Engine Watch thread. Here is a passage:

Link Analysis has shown great potential in improving the per-formance of web search. PageRank and HITS are two of the most popular algorithms. Most of the existing link analysis algorithms treat a web page as a single node in the web graph. However, in most cases, a web page contains multiple semantics and hence the web page might not be considered as the atomic node. In this paper, the web page is partitioned into blocks using the vision-based page segmentation algorithm. By extracting the page-to-block, block-to-page relationships from link structure and page layout analysis, we can construct a semantic graph over the WWW such that each node exactly represents a single semantic topic. This graph can better describe the semantic structure of the web. Based on block-level link analysis, we proposed two new algorithms, Block Level PageRank and Block Level HITS, whose performances we study extensively using web data.
block-links.jpg

posted rustybrick in Microsoft MSN Search at July 30, 2004 8:34 AM Comments (0)

Search by Area Code at Google, Yahoo But Not Ask

Before calling a prospect that I never spoke with before, I look up to see where this person is located geographically. Normally, the leads I get are from the United States, so I look up the area code of the phone number to find out what time zone they are in. I do this ever now and then, but this time I tried to go with the invisible tabs approach.

I knew Yahoo! improved this just recently, so I did a search on the area code 507 at Yahoo!.

yahoo-area-code.gif

Now that was easy, I gave the prospect a call. Unfortunately, they are closed on Mondays. :)

So then I travelled over to Google to see what they are in the form of zip code look up. Keep in mind, I am just searching on 507 with no special prefix or special tab.

google-area-code.gif

Not as pretty, but Google also gave me the information I was looking for without jumping through hoops.

Ok, now for my buddy Ask Jeeves. I thought it was a no brainer. Ask Jeeves is the leader in this type of stuff. Conducting a search at Ask on 507 did not, I repeat, did not, give me the information I wanted right on the search page. What happened to Ask Jeeves Search Smarter motto?

posted rustybrick in Search Technology at July 26, 2004 10:41 AM Comments (0)

After Clicking on a Result

A post over at HighRanking's Forum got me thinking. First, let me clarify that once you click on a result in Google or an other search engine, the engine knows nothing more.

Some people think that since you clicked from Google to page A, that Google knows how long you spent there, and if you made a purchase. Actually, my own dad was under the impression that Google knew if someone placed an order on a site if they came from the Google search engine. Sorry dad if your reading this but this really blew my mind.

In general, once you leave a search engine, they no longer know anything about your travels or actions. Now that that is out of the way, lets move on.

So the user gets to your site, you know where they came from and what they are doing on your site. Right? I'll assume you know this information, if not, then get yourself a decent Web analytics package. It is really a shame how ill-informed many of the smaller e-commerce site owners are about their sites traffic and conversion rates. If they tracked their traffic correctly, then making informed decisions on site changes become less guess-work and more metric based decisions. I keep getting to this, so I am sorry. Feel free to read my past entries on analytics.

Oh, and one more note, Scottie is right on.

posted rustybrick in Search Technology at July 1, 2004 12:17 PM Comments (0)

vCard Supported by Search Engines - Well Maybe

Looks like both Yahoo and Google can read vCards (defined by Microsoft as "The Internet standard for creating and sharing virtual business cards.")

Conduct a search in the format of allinurl:vcard.asp at both Yahoo! and Google and you will find vCards that come up.


Google now has a "view as HTML" version of the vCard. For example, here is someone's HTML version of the vCard format. Please do not call them. :) So Google is now reading these vCard formats. Here is one example where the exclusion protocol in the robot.txt file comes in handy.

Forum coverage at WebmasterWorld.

posted rustybrick in Search Technology at June 29, 2004 9:28 AM Comments (0)

SEM as a Business Decision

Can you have it both ways? Can a site that ranks well in the search engines also have a high conversion rate? How many first placed results have you seen that have the worst user interface? As an SEO or SEM consultant, is it your responsibility to provide both high rankings and high conversions? There is no doubt that a site needs both, high visibility and high conversion rates but as an SEO or SEM, is it your responsibility?

I say yes. A business decision is almost always about ROI. If you drive traffic to the site and your conversion rates are not high enough to make for a position ROI then the decision was bad. If you have an incredibly easy to use site and high conversion rates but no one sees your site, then it is a bad business decision.

If someone contracts your company and you only focus on rankings then you should have a partner that focuses on usability. This goes both way and it doesn't end with rankings and conversion rates. What about after the order. You need a good back-end management system to manage the thousands of orders your site is processing. Your customers need to be kept in the loop, they will account for most your business (repeat business). Good back-end tools and customer service systems enable this.

Managing your orders, customers and products efficiently, achieving high ranking and having high conversion rates leads to a good business decision.

Inspired by a post at highrankings forum.

posted rustybrick in Search Technology at May 10, 2004 4:57 PM Comments (0)

Comfort in Consistency But Problems in Predictability

This morning I went to my favorite bagel store to pick up breakfast. The guy who serves me saw me walk in the shop and yelled to the chef in the back, "two eggs on a bagel". I thought to myself, well that was nice - he knows what I wanted and ordered it for me. But then I thought, am I that predictable? maybe next time I should order something different. Being so predictable bothered me.

Let me relate this to search engines...

Search engines on one hand try to provide a level of consistency to their results in order to comfort the searcher. So if I search for "search engine roundtable" in Google, I want to find that this site listed number one every time and then find results that are relevant to the search engine roundtable under it. Relevancy and accuracy provides that level of comfort needed.

But on the other hand, search engines need to remove any level of predictibility in its search engine results. By that I mean, search engines can not and should not provide a schematic on how to ensure ones site is listed number 1, 2, or 3 for a selected keyword. If search engines were so predictable in terms of a person knowing he or she can guarantee a #1 spot for a specific keyword then it would be a joke. Same with the bagel story, if I walk in there and order the same thing each time, don't you think it might get a bit funny? It would be a problem if search engines were so predictable.

Consistency is comforting, predictability is problematic. My philosophical post for the month. :)

posted rustybrick in Search Technology at April 20, 2004 12:15 PM Comments (0)

Premium Sponsors + advertise

To subscribe to the Search Engine Roundtable, click here