Spider and DOS Defense - Rebels, Renegades, and Rogues

Nov 16, 2006 - 3:12 pm 0 by
Filed Under PubCon 2006

Vanessa Fox from Google to talk about bots... The basis is that major SEs bots behave well, they do not all use the same syntax, test your file to make sure you're blocking and allowing what you want and use webmaster tools to diagnose any problems with Googlebot. The standard at robotstxt.org info and at google.com/webmasters/. Google has good tools to help you out at Sitemaps. There is robots.txt, reobots meta tags, nofollow tag, password protect files, url removal tool and sitemap xml file. She then went through some of the tools out there. Sorry for lack of coverage here, something came up...

Dan Kramer was next up to whitelisting, cloaking and metrics on bots. Use your analytics to track your bot activity. He does custom scripting to set up a script that logs bot activity. They log if they have the http referrer header. You can selectively log certain requests based on footprints. Have the script email you reports. Even then there is manual work you need to do. Some people spoof good bots, so then you need to DNS reverse requests and whois info. Typically in the user agent info, there is a URL with more info. Bot detection strategy, search engines bots almost nver use an HTTP referer header, they use recognizable user agents. Search engine bots come from known IP ranges. Their IPs are normally registered to company. IPLists.coom, WMW has a good forum on it, fantomaster.com/fasvsspy01.html, jafsoft.com and others. What do you do with all this data? Bot control with robots.txt, mod_rewrite for handling, cloaking and banning bad bots. He explains what robots.txt is... He then gives some suggestions for these bots.

William Atchison from Crawl Wall is next up. Check out my past coverage from his presentation at SES San Jose over https://www.seroundtable.com/archives/004336.html. I love his presentation.

Brett Tabke was up next... WMW was knocked offline four times last year because of robots. There were dup content issues. They guess that 1 billion pages views by bots in 2005. WMW is a big target, so they get hit. He shows off the rogue access control system backend module for WMW, pretty cool. They require logins by agents, domains, IPS and referrers. They have a banned list on same criteria, they identify the search engines by domains, IPs and agent names. They also have whitelists by same criteria. The results were 90% reduction in rogue spider activity, 5 TB savings in bandwidth and average page generation time reduced by 50%.

 

Popular Categories

The Pulse of the search community

Follow

Search Video Recaps

 
- YouTube
Video Details More Videos Subscribe to Videos

Most Recent Articles

Search Forum Recap

Daily Search Forum Recap: December 10, 2024

Dec 10, 2024 - 10:00 am
Google

Google Crawler Documention Adds HTTP Caching

Dec 10, 2024 - 7:51 am
Google Search Engine Optimization

Google: Sometimes Over Optimization Drift Towards SEO Spam

Dec 10, 2024 - 7:41 am
Google Search Engine Optimization

Google Search Console Insights Removes Google Analytics Data

Dec 10, 2024 - 7:33 am
Bing SEO

Copilot Beta Now Bing Webmaster Tools For 10,000 Users

Dec 10, 2024 - 7:31 am
Google

Google Tests Trending & Popular With Labels In Search Results

Dec 10, 2024 - 7:21 am
Previous Story: Forums and Communities : Building and Optimization