The Invisible Spider: Covert Crawler

Jan 17, 2006 • 8:52 am | comments (0) by twitter Google+ | Filed Under Search Technology
 

A thread over at Cre8asite forums named New kind of spider is in town links to a Wired article named Covert Crawler Descends on Web. In short, this article describes a new kind of spider designed to crawl the Web as human-like as possible.

How Does it work?

The program comes from different internet addresses, simulates different browsers and throttles itself to human-like speeds... Hoffman's program downloads everything that comes with a page -- images, JavaScript and components like ActiveX and Flash -- instead of just hitting the page itself like traditional spiders do. It also simulates a full web browser, keeping a cache and requesting only new material... To select which links to click on, Hoffman has settled on a solution somewhere between a masterful AI and completely random selection. "In some ways it's a very simplified Turing test -- you can assign the different threads a personality. This crawler, you're the slow reader, you read the entire page." Another thread may spend less time on a page before it starts clicking on different links. "Each individual crawler has its own browser habits," he added.

Barry Welford calls this spider, "somewhat scary" and that I agree with. Ron Carnell has it right, "any robot that doesn't ask for and then follow robots.txt is, by definition, unethical." So Ron gives you a technique you can use to track and then block this type of bot.

Forum discussion at Cre8asite Forums.

Previous story: Yahoo Submit Your Site Still Timing Out
 

Comments:

No comments.

blog comments powered by Disqus