Spider and DOS Defense - Rebels, Renegades, and Rogues

Nov 16, 2006 • 3:12 pm | comments (0) by twitter Google+ | Filed Under WebmasterWorld 2006 Las Vegas
 

Vanessa Fox from Google to talk about bots... The basis is that major SEs bots behave well, they do not all use the same syntax, test your file to make sure you're blocking and allowing what you want and use webmaster tools to diagnose any problems with Googlebot. The standard at robotstxt.org info and at google.com/webmasters/. Google has good tools to help you out at Sitemaps. There is robots.txt, reobots meta tags, nofollow tag, password protect files, url removal tool and sitemap xml file. She then went through some of the tools out there. Sorry for lack of coverage here, something came up...

Dan Kramer was next up to whitelisting, cloaking and metrics on bots. Use your analytics to track your bot activity. He does custom scripting to set up a script that logs bot activity. They log if they have the http referrer header. You can selectively log certain requests based on footprints. Have the script email you reports. Even then there is manual work you need to do. Some people spoof good bots, so then you need to DNS reverse requests and whois info. Typically in the user agent info, there is a URL with more info. Bot detection strategy, search engines bots almost nver use an HTTP referer header, they use recognizable user agents. Search engine bots come from known IP ranges. Their IPs are normally registered to company. IPLists.coom, WMW has a good forum on it, fantomaster.com/fasvsspy01.html, jafsoft.com and others. What do you do with all this data? Bot control with robots.txt, mod_rewrite for handling, cloaking and banning bad bots. He explains what robots.txt is... He then gives some suggestions for these bots.

William Atchison from Crawl Wall is next up. Check out my past coverage from his presentation at SES San Jose over http://www.seroundtable.com/archives/004336.html. I love his presentation.

Brett Tabke was up next... WMW was knocked offline four times last year because of robots. There were dup content issues. They guess that 1 billion pages views by bots in 2005. WMW is a big target, so they get hit. He shows off the rogue access control system backend module for WMW, pretty cool. They require logins by agents, domains, IPS and referrers. They have a banned list on same criteria, they identify the search engines by domains, IPs and agent names. They also have whitelists by same criteria. The results were 90% reduction in rogue spider activity, 5 TB savings in bandwidth and average page generation time reduced by 50%.

Previous story: Forums and Communities : Building and Optimization
 

Comments:

No comments.

blog comments powered by Disqus