Search Technology Archives

Did Google Trends Predict the Super Tuesday Results?

Yesterday was Super Tuesday in the United States and many of the presidential primaries were held. SEO.com has written up a post that seems to predict the trends in each state to determine the winners. Using Google Trends, it seems plausible that people's searches impacted the winners.

Of course, not everything is perfect using Google Trends, as pointed out by DigitalPoint Forums members. However, it's interesting to see how many results correlated with the trending data.

I can imagine that other competitive intelligence firms will come out with results later today or tomorrow that will measure the results even more accurately.

Or will they?

Forum discussion continues at DigitalPoint Forums.

posted Tamar Weinberg in Search Technology at February 6, 2008 9:49 AM Comments (2)

Who Is Better At Finding Duplicate Content?

So who likes finding duplicate content more? Or who has the most to gain by finding duplicate content - Google or Copyscape? The forum members on Digitalpoint are having an interesting conversation on the differences in the ways Google and Copyscape find duplicate content. Most agree they use different algorithms to find duplicate content, but how hard can it be though? The consensus remarkably is that the majority of people see both companies as unique in their way to find duplicate content.

WebmasterWorld is also having some discussion on duplicate content. Namely, how do you deal with heavily copied content? Some of the members has some excellent advice on dealing with duplicate content these days.

1. Don't Go After the Content Scrapers


I rarely go directly after the copiers these days - instead I focus on strengthening the website itself. It's harder for copied content to beat a strong website - but from time to time it happens.

2. Go After the Web Hosting Company Instead

First, file DMCA notices against all US-based webhosts or server companies involved. Personally, I skip informing the webmaster first as he might not be US-based and it seldom works. Hosts and server companies normally take down whole sites or even servers (not just individual infringing pages) meaning potentially crippling losses for the webmaster renting a dedicated server from which he runs multiple scrapers.

3. Strengthen Your Website

The older a site is and the more pages and more trust it gains (along with measures to help deter scraping like using full urls, etc.) the less likely that having scraped content will cause any harm.

4. Don't Abandon The Original Content

Second, if you do re-write, don't abandon your original content. There's obviously a market for it, so arrange for it to be used on other sites, by agreement and with appropriate links.

5. Insert Your Website or Company Name Into The Content

I try and generally include a link to two in my content to another content page in my site, since most people copying do so by way of automation and pick up your link.

6. Consider That Eventually Someone Will Steal Your Content

i've personally given up on any rights on anything in any matter on the internet - I don't publish anything or put anything on the internet which I don't want to be re-distributed on a massive scale, be edited, laughed at, cried about, never quoted

Continued discussion at Digitalpoint and WebmasterWorld

posted Phoenix in Search Technology at October 4, 2007 12:38 PM Comments (0)

Google Campaign Optimizer – A Friend or Foe?

Google AdWords and other paid search marketing (PPC or pay-per-click) platforms including Yahoo! Search Marketing and MSN AdCenter provide very robust administration platforms to allow for a variety of options to be specified when serving ads to searchers and contextual partner sites. Just over the past two years, these systems have become increasingly capable of allowing advertisers to focus their budgets towards keyword phrases that are providing return on their advertising spend. Even with all these capabilities, a lack of testing, research and/or understanding of past performance can make campaigns inefficient.

Google offers one particular option to its AdWords advertisers called “campaign optimizer.” In the past, this was an option that could be found in the campaign level settings page, however Google seems to be pushing their “enhancements” more often within the main dashboard. A recent thread started at WebmasterWorld forums describes this situation, as a user relates seeing messages within his dashboard to the effect of running the campaign optimizer in order to “increase traffic by 17%.”

Is this the right tactic to use? Should advertisers allow Google to enhance their campaigns’ performance? In my opinion it depends on the account. If the owner of the account has the time to run tests and optimize the campaign by themselves, they will likely do at least as well as or better than Google, without having to fear that in Google’s eye’s “optimization = spending the entire daily budget each day.”

Join the discussion at WebmasterWorld forums.

posted chrisboggs in Search Technology at October 4, 2007 8:33 AM Comments (0)

Is PageRank Juice the Only Value of a Link?

A Cre8asiteforums discussion called The Divide Between Search Engines And Seo's - "No Follow" Fiasco points to two somewhat emotional discussions elsewhere on the possible ramifications or practices of using the "rel=nofollow" tag in links.

In one case, the US Federal Trade Commission enters the arena to stir up the pot for paid links.

In another case, a blog directory has been accused of not passing "link juice" to the blogs who have submitted to it and using JavaScript "onclick" code in their URLS.

The thread at Cre8asiteforums points to both discussions and members returned to voice their opinions. Both the Sphinn and SEOFastStart discussions provide a chance to learn more, regardless of who is right or wrong. I was voted completely out into the universe in Sphinn for remarking that PR can't possibly be the "only" reason people submit to blog directories.

Apparently, I'm terribly wrong about that.

posted cre8pc in Search Technology at September 13, 2007 12:07 PM Comments (3)

Getting Users to Bookmark Your Site: Traditional Bookmarking vs. Social Bookmarking

Traditional bookmarking seems obsolete. Adding a bookmark to your browser, to many, is a practice that has been since seemingly replaced by newer methods -- social bookmarking sites, if you will.

However, not everyone is aware of these social bookmarking sites, nor are they ready to abandon their traditional methods of bookmarking. A Cre8asite Forums thread touches upon this subject. In the thread, administrator EGOL suggests that traditional means of bookmarking stay intact, and social bookmarking methods through sites like AddThis.com not necessarily be implemented -- or at least done as a secondary option.

This is exactly what other members agree is the right thing to do:

Absolutely. For most visitors bookmark this site means triggering a bookmark in their browser. I would add social bookmarking micro-icons for the rest. It also comes out as more honest.
The other angle is whether to have social bookmarking sites only, browser bookmarking only, both, or simply some text saying "hey, press Control-D to bookmark this!". IMHO, you need at the very least the Ctrl-D text and some of the social bookmarking sites.

For those not ready to jump into the social bookmarking realm, you should make sure that if you include a bookmarking option, your website accommodates these types of users.

Discussion continues at Cre8asite Forums.

This article was written this past Monday and scheduled for publication on Wednesday, May 23rd.

posted Tamar Weinberg in Web Promotion at May 23, 2007 9:27 AM Comments (1)

75% of Google's Blogspot Blogs are Spam

On a recurring theme of Internet spam, a study discussed in WebmasterWorld indicates that three out of four blogs -- or 75% -- are spam.

According to the study (PDF link):

...14 of the top-15 doorway domains have a spam percentage higher than 74%; that is, 3 out of 4 unique URLs on these domains (that appeared in our search results) were detected as spam. To demonstrate the need for scrutinizing these sites, we scanned the top-1000 results from two queries – “site:blogspot.com phentermine” and “site:hometown.aol.com ringtone” – and identified more than half of the URLs as spam easily.

Here is a chart from the study showing the "top doorway domains and their spam percentages (among the search results in our data)":
top doorway domains and their spam %

The reason for this is the suspicion that the popular blogging service is free. One WebmasterWorld member states:

The trouble is, there's no algorithm that can automatically factor in the price of a service. It's free to set up a blog on Blogger, so it can be abused more easily. If these spammers actually had to pay for a new domain name every time they set up a splog, they wouldn't bother.

Other findings of this research showed the spam percentages for Top-Level Domains (TLDs):

  • 68% of .info TLDs are spam
  • 53% of .biz TLDs are spam
  • 12% of .net TLDs are spam
  • 11% of .org TLDs are spam
  • 4.1% of .com TLDs are spam


Forum discussion continues at WebmasterWorld.

posted Tamar Weinberg in Spam at March 20, 2007 9:43 AM Comments (8)

Yahoo! & Microsoft Release Papers on Web Spam

A WebmasterWorld thread links to a December 2006 paper at Yahoo! Research named A Reference Collection for Web Spam. The paper can be downloaded as a PDF file, it is not brand new, but relatively new. Here is the abstract:

We describe the WEBSPAM-UK2006 collection, a large set of Web pages that have been manually annotated with labels indicating if the hosts are include Web spam aspects or not. This is the first publicly available Web spam collection that includes page contents and links, and that has been labeled by a large and diverse set of judges.

Gary Price of ResourceShelf linked to an updated paper from Microsoft on Web spam. The 10 page PDF file is named "Spam Double-Funnel: Connecting Web Spammers with Advertisers." Here is the abstract:

Spammers use questionable search engine optimization (SEO) techniques to promote their spam links into top search results. In this paper, we focus on one prevalent type of spam – redirection spam – where one can identify spam pages by the third-party domains that these pages redirect traffic to. We propose a five-layer, double-funnel model for describing end-to-end redirection spam, present a methodology for analyzing the layers, and identify prominent domains on each layer using two sets of commercial keywords – one targeting spammers and the other targeting advertisers. The methodology and findings are useful for search engines to strengthen their ranking algorithms against spam, for legitimate website owners to locate and remove spam doorway pages, and for legitimate advertisers to identify unscrupulous syndicators who serve ads on spam pages.

So here is your weekend reading.

Forum discussion at WebmasterWorld.

posted rustybrick in Spam at March 16, 2007 7:41 AM Comments (0)

How Do Search Engine Robots Work?

I have always had a thing for spiders. Not the creepy crawly kind, but the one made of bits and bytes who scour the web for new documents to index and download. They are so predictable but at the same time quite surprising you when you least expect it. How the hell did they do that or find that page? Many a webmaster has scratched their hand in disbelief at a crawler at one time or another. There is a thread on WebmasterWorld asking new questions about the various characteristics of a how search engine crawling technology works and the bare bones infrastructure of how a search engine goes from finding a page to ultimately deciding to list it in its search engine results. This is the nuts and bolts of the technology and also updating previously known information with new questions and answers.

So how do search engine robots work and what comprises them?


Spider : a robotic browser like program that downloads webpages.
Crawler : a wandering spider that automatically follows links found on pages.
Indexer : a blender like program that dissects webpages that are downloaded by spiders.
The Database : a warehouse of the pages downloaded and processed.
Search Engine Results Engine : digs search results out of the database

Pageoneresults takes it a step further in creating this thread to ask new questions about search engine robots for those that are not previously familiar.


1. Do robots accept cookies?
2. What happens if my site forces a cookie?
3. Do robots execute JavaScript functions?
4. Could I be doing something technically that is stopping a robot from indexing my site?
5. How do robots interpret my page?
6. In what order to robots index my page? What is the very first step that robot takes?

Continued discussion on WebmasterWorld - How Do Robots Work?

posted Phoenix in Search Technology at January 10, 2007 9:02 AM Comments (0)

SEO is Magical Fairy Dust and That's Final

As juicy topics go, there's always "SEO is Dead", "SEO isn't Rocket Science", or is, and "classic" SEO is about as useful as sugarless jello. Starting off the year with a bang, Mike Grehan jumps in with a ClickZ article called SEO: Art, Science, Bollocks Or What? .

Taking a cue from more recent debates that brought out Danny Sullivan and Kevin Lee, Mike refers to a book called Web Dragons: Inside the Myths of Search Engine Technology, which he just digested. He asks us to take another look at our purpose for search in the first place and how we can apply new technology to better benefit business endeavors for the long haul.

In other words, marketing. Discussion is at Cre8asiteforums

posted cre8pc in Search Technology at January 8, 2007 10:16 PM Comments (2)

Bot Attacks: Yes It Can Happen To You

We all know about PPC fraud and that some of the fraud is caused by bots (robots) that click on the ads and drive up your bill and unwanted traffic. But it gets more serious than that. Bot are also used to steal your content, spam your site with comment spam, guestbook spam, dhtml spam and some very bad hacks.

Often, when someone writes a script to have a bot do any of the evil things they may do, they let the bot run wild. Sometimes that may take down your server.

Discovery at Search Engine Watch Forums links to a Wired article named Attack of the Bots.

The latest threat to the Net: autonomous software programs that combine forces to perpetrate mayhem, fraud, and espionage on a global scale. How one company fought the new Internet mafia – and lost.

Bots have gotten to us, they have. They got to WebmasterWorld, DigitialPoint Forums, Search Engine Watch Forums and many many other sites.

Discovery asks, not only in terms of PPC fraud and click fraud;

Have your concerns with bots grown over this past year?

I answer, Yes.

Forum discussion at Search Engine Watch Forums.

posted rustybrick in Spam at November 8, 2006 6:56 AM Comments (0)

Do You Care What The Search Data Says?

Or maybe I should change that to "Did you know that site search data has a story to tell?"

I found it interesting that there may be a lack of education on the part of SEO/M's, site designers, programmers and site owners (okay, all of us), on the value of website search. I'd come across a search analytics survey by Lou Rosenfeld and Rich Wiggins and decided to present it for discussion in What are the barriers to taking advantage of search analytics?

The reported verbatim answers to the survey on search analytics fascinated me, and so has the resulting conversation in the forum. Seems as though there's room for education on site search and its value, for those who offer site-wide searches. There are also tools that tell you what people are searching for that led to your site, but not much support on what you actually do with that information.

Every search phrase has a story. The survey supports the theory that there are "barriers" to "taking advantage of search analytics." Ammon Johns wrote:

"There were thousands of searches per day made, and trust me that the long tail was very visible.

It's clear, from the comments at Cre8asiteforums, that there are those who are fiddling with the data as best they can figure, and many more who don't know what to do with this pony.

posted cre8pc in Search Technology at September 27, 2006 7:01 PM Comments (0)

Search Ad Keyword Phrase Deletion Probabilities

Bill Slawski at Cre8asite Forums has a thread named Deletion Probablilities for Better Ads. In that thread he clearly explains a Yahoo! Patent named System and methods for ranking the relative value of terms in a multi-term search query using deletion prediction, the abstract reads;

The likely relevance of each term of a search-engine query of two or more terms is determined by their deletion probability scores. If the deletion probability scores are significantly different, the deletion probability score can be used to return targeted ads related to the more relevant term or terms along with the search results. Deletion probability scores are determined by first gathering historical records of search queries of two or more terms in which a subsequent query was submitted by the same user after one or more of the terms had been deleted. The deletion probability score for a particular term of a search query is calculated as the ratio of the number of times that particular term was itself deleted prior to a subsequent search by the same user divided by the number of times there were subsequent search queries by the same user in which any term or terms including that given term was deleted by the same user prior to the subsequent search. Terms are not limited to individual alphabetic words.

Bill explains the logic of the patent as a method of deleting the less relevant word, if the whole phrase of the search query does not match an ad within the ad inventory.

This could be done by looking at two word searches from users, and seeing if they might delete one of the words in a follow-up search. Search engineers might be able to set something up to find such deletions, and create a "deletion probability score" for terms.

More details at Cre8asite Forums.

posted rustybrick in Search Technology at June 19, 2006 8:10 AM Comments (2)

SEOs, Don't Be Fooled by Personalized Search Results

There is a thread with a fun name at High Rankings Forum named False Gods. The thread discusses how when searching for some "ego keywords" (keywords a person wants to rank well for) he found himself ranking well. But then he noticed that Google personalized search was turned on.

The results within personalized search, no matter which search engine, are tailored to your liking. So if you want to be number one for "seo" you can be over time. Especially if you use the remove result function until your site is #1 and also if you tend to click on your pages more often than others.

Past related article on this that may be of interest is named Search Engine Optimization is Changing So Quickly.

Forum discussion at High Rankings Forum.

posted rustybrick in Search Technology at June 14, 2006 8:28 AM Comments (0)

Google PageRank Patent Updated

The Google PageRank Patent has been updated the other day, the patent is titled Method for node ranking in a linked database. The abstract reads;

A method assigns importance ranks to nodes in a linked database, such as any database of documents containing citations, the world wide web or any other hypermedia database. The rank assigned to a document is calculated from the ranks of documents citing it. In addition, the rank of a document is calculated from a constant representing the probability that a browser through the database will randomly jump to the document. The method is particularly useful in enhancing the performance of search engine results for hypermedia databases, such as the world wide web, whose documents have a large variation in quality.

So what was changed? Bill Slawski says four things, but nothing substantial to the algorithm itself.

1. The references section was updated in this patent to include documents that are listed in the filing granted on September 28, 2004.

2. The abstract section remains the same in the new document, but the claims section was reduced in length, and appears to now cover aspects of both previous patent filings.

3. There are some minor looking changes in the “Detailed Descriptions” section between the version granted in 2001, and the one granted today.

4. The main changes appears in the summary section of the document. In the two previous documents, there were many passages that were repeated, but there were also differences. I’ve copied the areas of that section below where the three differ:

Continue reading the differences at SEO By the Sea.

Also Forum discussion at WebmasterWorld.

posted rustybrick in Google Optimization at June 8, 2006 7:32 AM Comments (0)

One Size Fits All: Optimize for Google to Optimize For Yahoo & MSN?

Over the past 6 months, ever since MSN really entered the search space, SEOs have now began talking about optimizing differently for each search engine algorithm. Back in the old days, there used to be a handful of search engines that people had to worry about. Then there was just Google, really. But not it is more diverse - we have four search algorithms to worry about. We have Google, we have Yahoo, we have MSN and we have Ask.com, which is increasing share each day.

ProjectPHP, Cre8asite Forum administrator, started a thread named Do Search Engines Use One Algorithm For All Results? The obvious answer to me is no, they all use different engines. But what you typically find these days are that people have one optimization strategy for MSN and Yahoo and then an other optimization strategy for Google. So how do you work that? You can cloak - oh no you can't! You can build different sites for different engines. Work different pages or subdirectories of your site for different engines. Or you can pay them al off. :)

But seriously, the thread gets into the heads of what is going on today, in the trenches of SEO. If I tweak for engine X, will that hurt my rankings on engine Z? Do you believe all engines share the same algorithm? I don't. Do you?

Forum discussion at Cre8asite Forums.

posted rustybrick in Search Engine Optimization at May 19, 2006 7:37 AM Comments (0)

Yahoo! Fighting Web Spam: TrustRank & Link Spam Patent Application

To clarify before even beginning, Yahoo! does not necessarily uses these techniques, they are just patent applications issued by Yahoo!

Bill Slawski posted an outstanding blog entry at SEW Blog named In Yahoo We Trust - The Link Spam Patent Application discusses one of Yahoo!'s papers and a patent application on fighting Web search spam.

(1) Combating Web Spam with TrustRank which discusses how non spam pages link to non spam pages, as Bill describes in short.

(2) Link-based spam detection which describes, similar to PageRank, the ability to "manually identifying reputable seed pages" and "separating reputable pages from spam pages."

Forum discussion on these topics at Cre8asite Forums & Search Engine Watch Forums.

posted rustybrick in Search Technology at May 5, 2006 7:54 AM Comments (1)

The Science Behind Google's Algorithms by Princeton University

Philipp Lenssen covered a new book named Google's PageRank and Beyond : The Science of Search Engine Rankings by Amy N. Langville and Carl D. Meyer from Princeton University Press. Reportedly, the book is incredibly mathematical, scientific and technical.

Jon Kleinberg of Cornell University (some of may you may know the name) gave the following review;

Comprehensive and engagingly written. This book should become an important resource for many audiences: applied mathematicians, search industry professionals, and anyone who wants to learn more about how search engines work.

Interested in a copy? Discuss it with others at DigitalPoint Forums.

posted rustybrick in Search Technology at May 3, 2006 7:56 AM Comments (1)

Google's Three Wireless Advertising Patent Applications

As Bill Slawski accurately notes in a thread he created at Cre8asite Forums named Google's 3 Wireless Advertising Patent Applications, branding, advertising, and subsidizing, discussion on wireless advertising patents have been the craze recently. News.com reports on it, first the pushed out a title that said Google has won the rights to the patent, but that was quickly corrected. But I much more prefer to read the Cre8asite Forum thread.

Bill summarizes the three patent applications.

(1) Method and system to provide wireless access at a reduced rate:

Methods and system for providing wireless access at a reduced rate. In one embodiment, access to a WAP is provided to an end-user at a rate subsidized by a first entity. The first entity includes advertisements in an end-user view.

Bill explains that this one has more to do with wireless access, which reminds me of a Gary post named Google Awarded Patent To Make Data Move Faster to Wireless Phones and Devices.

(2) Method and system to provide advertisements based on wireless access points:

Methods and system to provide advertisements in a view of an end user accessing a wireless access point. The advertisements are related to the WAP based on a predetermined criterion.

Basically discusses the "integration" of the wireless ads into wireless enabled devices, I believe there are some geo specific ads as well discussed here.

(3) Method and system for dynamically modifying the appearance of browser screens on a client device:

In one embodiment, a connection of a client device to a wireless access point is identified. Further, the appearance of a screen presented on the client device is modified to reflect the brand associated with a provider of the wireless access point.

This is basically about branding the ads with the WAP partner's logo and content.

So in short you have three patent applications from Google. One about optimizing the ads across wireless protocols. The second is about the integration of the ads and the third is about branding those ads.

Forum discussion at Cre8asite Forums.

posted rustybrick in Google Optimization at March 27, 2006 3:04 PM Comments (3)

GoogleBot Goes Wireless - Google Mobile Transcoding

It looks like we have a new GoogleBot that we need to worry about. The Google spider has the User Agent; "Nokia6820/2.0 (4.83) Profile/MIDP-1.0 Configuration/CLDC-1.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)" and comes from a Google IP address. The bot is not incredibly new, but it is picking up speed. Many Webmasters are noticing this new creature explore their sites. There is a whole section on the Google Remove URL page under the Remove transcoded pages anchor. This transcoding translates the page and strips out some html, which is upsetting to some. Mobile is important to Google, they have even added mobile queries to Google Sitemaps recently.

Forum discussion at Search Engine Roundtable Forums, WebmasterWorld and DigitalPoint Forums.

posted rustybrick in Google Optimization at March 20, 2006 8:05 AM Comments (0)

Search Advertising Patent Applications

Bill Slawski at Cre8asite forums started a thread he named Google advertising patent applications, have fun with a list of several fairly recent patent application on search advertising technology. Here is a quick summary of them:


  1. Adjusting ad costs using document performance or document collection performance
  2. Advertisements for devices with call functionality, such as mobile phones
  3. Facilitating the serving of ads having different treatments and/or characteristics, such as text ads and image ads
  4. Automated graphical advertisement size compatibility and link insertion
  5. System and method for rating electronic documents
  6. Results based personalization of advertisements in a search engine
  7. Rendering content-targeted ads with e-mail
  8. Selectively delivering advertisements based at least in part on trademark issues
  9. System and method for providing on-line user-assisted Web-based advertising
  10. Method and system for providing targeted graphical advertisements
  11. Generating user information for use in targeted advertising
  12. Systems and methods detecting for providing advertisements in a communications network
  13. System and method for enabling an advertisement to follow the user to additional web pages
  14. System and method for automatically targeting web-based advertisements
  15. Generating information for online advertisements from Internet data and traditional media data
  16. Promoting and/or demoting an advertisement from an advertising spot of one type to an advertising spot of another type
  17. Serving advertisements using a search of advertiser Web information
  18. Rendering advertisements with documents having one or more topics using user topic interest information
  19. Using enhanced ad features to increase competition in online advertising
  20. Method and system for dynamic textual ad distribution via email
  21. Serving content-relevant advertisements with client-side device support
  22. Serving advertisements based on content
  23. Methods and apparatus for serving relevant advertisements
  24. Method and system for providing advertising listing variance in distribution feeds over the internet to maximize revenue to the advertising distributor
  25. Method and system for providing filtered and/or masked advertisements over the internet
  26. Method and system for providing advertising through content specific nodes over the internet

Um, wow, ummm, now that is one huge list. Forum discussion at Cre8asite Forums.

posted rustybrick in Search Technology at March 17, 2006 8:44 AM Comments (0)

The Ultimate Forum Thread for Search Papers and Patents

I hope the member who posted the thread titled List of papers and patents?, Got one, guv? knew what he was getting himself into. All you need is Bill to see that and serve up one of the most comprehensive list of papers in a forum thread ever.

There is not much I can say, I am still a bit in shock from the list.

Forum discussion at Cre8asite Forums.

posted rustybrick in Search Technology at February 16, 2006 7:53 AM Comments (1)

Search Engines Find a Way to Gauge and Confirm Trust

It started with an interview I found on "Trustmarks", in which Paul Walsh, the co-founder and CEO of Segala M Test, was talking about a way to enhance personalized search by including a trust rating.

Perhaps you've heard of ICRA (Internet Content Rating Association) descriptors for child protection. I've had this code on one of my sites for years, since I wanted parents to trust it. (And being one, I care about that sort of thing.) Walsh was interviewed about the Segala trustmark scheme and their working with the World Wide Web Consortium (W3C). Segala is a founding sponsor of the Mobile Web Initiative (MWI) responsible for creating best practices and guidelines for the future Web on small screens such as PDAs and mobile phones. Paul Walsh is also a committee member of the Web Accessibility Initiative.

I was very curious about how search engines, or if search engines, would implement "Trustmarks". We started a thread at Cre8asiteforums about it, featuring the interview, and several members were most interested. Some wondered how this trust is tracked. What would stop anyone from being registered, getting a trustmark and then changing their content?

Paul Walsh dropped by when the thread first began to tell us more.

Continue reading "Search Engines Find a Way to Gauge and Confirm Trust"

posted cre8pc in Search Technology at January 27, 2006 3:42 PM Comments (1)

The Invisible Spider: Covert Crawler

A thread over at Cre8asite forums named New kind of spider is in town links to a Wired article named Covert Crawler Descends on Web. In short, this article describes a new kind of spider designed to crawl the Web as human-like as possible.

How Does it work?

The program comes from different internet addresses, simulates different browsers and throttles itself to human-like speeds... Hoffman's program downloads everything that comes with a page -- images, JavaScript and components like ActiveX and Flash -- instead of just hitting the page itself like traditional spiders do. It also simulates a full web browser, keeping a cache and requesting only new material... To select which links to click on, Hoffman has settled on a solution somewhere between a masterful AI and completely random selection. "In some ways it's a very simplified Turing test -- you can assign the different threads a personality. This crawler, you're the slow reader, you read the entire page." Another thread may spend less time on a page before it starts clicking on different links. "Each individual crawler has its own browser habits," he added.

Barry Welford calls this spider, "somewhat scary" and that I agree with. Ron Carnell has it right, "any robot that doesn't ask for and then follow robots.txt is, by definition, unethical." So Ron gives you a technique you can use to track and then block this type of bot.

Forum discussion at Cre8asite Forums.

posted rustybrick in Search Technology at January 17, 2006 8:52 AM Comments (0)

Researching Search Engine Results and How People Use Them for Research

I'm meddling in Bill Slawski's territory here, in that he's better known for presenting and analyzing papers on search engine technology. However, I caught this one and since it uses usability testing scenerios in the research, I gave it a shot.

The paper is Using meaningful and stable categories to support exploratory web search: Two formative studies by Bill Kules and Ben Shneiderman, of the Department of Computer Science, Human-Computer Interaction Laboratory and Institute for Advanced Computer Studies, University of Maryland.

The purpose of the study is to better understand how people use search engines to research topics - specifically, how categorization of search results applies to the end user experience.

"Categorizing web search results into comprehensible visual displays using meaningful and stable classifications can support user exploration, understanding, and discovery. We report on two formative studies in the domain of U.S. government web search that investigated how searchers use categorized overviews of search results for complex, exploratory search tasks."

They ran test subjects through a variety of tasks. Here is one example.

"Scenario 2 (Breast cancer) - You are a 30-year old journalist writing an article on breast cancer and what the federal government is doing about it. You are exploring the topic, starting by looking on the Web to find out what kind of information is available. You have just entered the search terms "breast cancer".

Continue reading "Researching Search Engine Results and How People Use Them for Research"

posted cre8pc in Search Technology at January 3, 2006 1:44 PM Comments (0)

Craigslist Blocks Most Spiders: Millions of Pages Delisted

A thread started at our forums named Craigslist Delists Millions of Pages from Search Engine Indexes uncovers the new robots.txt file in place over at Craigslist. It basically reads;

############################## # Exclude robots from these

User-agent: YahooFeedSeeker
Disallow: /forums
Disallow: /res/
Disallow: /post
Disallow: /email.friend
Disallow: /?flagCode
Disallow: /ccc
Disallow: /hhh
Disallow: /sss
Disallow: /bbb
Disallow: /ggg
Disallow: /jjj

User-agent: *
Disallow: /cgi-bin
Disallow: /cgi-secure
Disallow: /forums
Disallow: /search
Disallow: /res/
Disallow: /post
Disallow: /email.friend
Disallow: /?flagCode
Disallow: /ccc
Disallow: /hhh
Disallow: /sss
Disallow: /bbb
Disallow: /ggg
Disallow: /jjj


#####################################

They supposedly had millions, 3.6 Million to be exact, of pages indexed at Google and millions at the other search engines. Now? 211,000 at Google, 280,000 at Yahoo and 4,695 atMSN.

Forum discussion at Search Engine Roundtable Forums.

posted rustybrick in Search Technology at January 3, 2006 8:10 AM Comments (0)

Google Analytics (ex-Urchin) Delivers Web Analytics for FREE

Google has now re-branded Urchin to Google Analytics presenting users with better ways to “understand and influence visitor behavior and generate a higher ROI on marketing initiatives”. Yes folks! It’s offering a free hosted web analytics service, in hopes that advertisers, publishers and website owners will spend time understanding how people find their websites, navigate through them and convert on the goals of the site. With the free service, Google hopes it helps people spend money on their search marketing campaigns rather than on measurement. This is going to have a huge impact on both the search marketing and the web analytics industries. Draw your own conclusions.

But how much is really free? Google Analytics will allow you to track up to 5 million pageviews per month, no questions asked, no fees charged. So you have a BIG MONSTER website, then all they request is that you have at least one active Adwords account with an active campaign and spend $1 if you want, that’s all it takes. No more pageview caps. I’m sure they hope you spend much more than that when you see all the tracking benefits.

What’s more in this move, Google Analytics now allows integration with AdWords to better monitor “ROI metrics automatically without having to import cost data or tag keywords”, as well as tracking all of your other internet marketing initiatives as well. When you subscribe to it, you will see it as a new tab under your AdWords account. It now has executive, marketer, and webmaster dashboards for view quick summaries of “traffic, e-commerce, and conversion trends without hunting through reports.” Here is what else it offers:


  • Reporting interface accessible directly from the google.com/analytics website if you don’t have an Adwords account

  • Advanced visitor segmentation with over 80 web analytics reports

  • Ability to track up to 50 websites within your account

  • Site overlay

  • Funnel visualization

  • GeoTargeting with a cool map that shows where your traffic comes from
  • It’s available in 16 languages: Chinese (Simplified), Chinese (Traditional), Danish, Dutch, Finnish, French, German, Italian, Japanese, Korean, Norwegian, Portuguese, Russian, Spanish, Swedish and English.

  • And much more…


For those worried on privacy concerns, this is what they say, “Google takes the trust people place in us very seriously, and we are committed to safeguarding the privacy of your data. We understand that web analytics data is sensitive, so we accord it the ironclad protection it deserves. Google Analytics is subject to the same industry leading privacy policy as all Google services: http://www.google.com/privacypolicy.html

On a personal note, I’m also very excited with the steps Google is making because my consulting firm, iHispanic Marketing Group, is proud to announce that Google Analytics has chosen us as one among other Client Service and Support Consultants to service the global Hispanic market. With this strategic alliance we are committed to delivering professional services for training, advanced support, and expert web analytics consulting to executives, marketing managers and webmasters in both Spanish and English. Our loyalty we’ve had to Urchin and to our clients have demonstrated great rewards. Google Analytics will be a fun ride moving forward to continue building leadership with the Hispanic market for search engine marketing and internet strategy.

For discussion on this topic, you’re welcome to share your thoughts in the SearchEngineWatch Forum’s thread: Urchin Now Google Analytics, Now Free.

posted nacho in Tracking & Conversion Measurements at November 13, 2005 11:16 PM Comments (3)

Large Listing of Search Patent Application, Not from Google

Gary Price of Search Engine Watch took a look at some non-google patent applications from Yahoo, Microsoft and Others today. They include HP, Microsoft, Yahoo!, and Oveture (yahoo owned).

As posted by Gary;

Title: Method and system for identifying image relatedness using link and page layout analysis
Assignee: Microsoft

Title: Method and system for classifying display pages using summaries
Assignee: Microsoft

Title: Method and apparatus for performing a search
Assignee: Yahoo

Title: Method and system for ranking documents of a search result to improve diversity and information richness
Assignee: Microsoft

Title: Contextual flyout for search results
Assignee: IBM

Title: Method and apparatus for providing information
Assignee: Fujitsu

Title: Method and apparatus for identifying related searches in a database search system
Assignee: Overture/Yahoo

Title: Verifying relevance between keywords and Web site contents
Assignee: Microsoft

Title: Systems and methods that rank search results
Assignee: Microsoft

Title: Search systems and methods with integration of user annotations
Assignee: Yahoo

Title: Integration of instant messenging with Internet searching
Assignee: Yahoo

Title: Search system using user behavior data
Assignee: Microsoft

Forum discussion at Search Engine Watch Forums.

posted rustybrick in Search Technology at November 10, 2005 11:06 AM Comments (1)

Link Spam Detection Research Paper

Last night Gary blogged on A New Report on Estimating Link Spam. Gary explains that the "21 page (pdf) technical research paper from the Stanford InfoLab that takes a look at link spam." The paper was written by two folks at Yahoo and two at Stanford; Zoltan Gyongyi (Stanford), Pavel Berkhin (Yahoo), Hector Garcia-Molina (Stanford), Jan Pedersen (Yahoo).

Read Link Spam Detection Based on Mass Estimation if you dare. :)

Link spamming intends to mislead search engines and trigger an artificially high link-based ranking of specific target web pages. This paper introduces the concept of spam mass, a measure of the impact of link spamming on a page's ranking. We discuss how to estimate spam mass and how the estimates can help identifying pages that benefit significantly from link spamming. In our experiments on the host-level Yahoo! web graph we use spam mass estimates to successfully identify tens of thousands of instances of heavy-weight link spamming.

Forum discussion soon to be at this thread at Search Engine Watch Forums.

posted rustybrick in Search Technology at November 9, 2005 8:06 AM