Spider and DOS Defense - Rebels, Renegades, and Rogues

Nov 16, 2006 - 3:12 pm 0 by
Filed Under PubCon 2006

Vanessa Fox from Google to talk about bots... The basis is that major SEs bots behave well, they do not all use the same syntax, test your file to make sure you're blocking and allowing what you want and use webmaster tools to diagnose any problems with Googlebot. The standard at robotstxt.org info and at google.com/webmasters/. Google has good tools to help you out at Sitemaps. There is robots.txt, reobots meta tags, nofollow tag, password protect files, url removal tool and sitemap xml file. She then went through some of the tools out there. Sorry for lack of coverage here, something came up...

Dan Kramer was next up to whitelisting, cloaking and metrics on bots. Use your analytics to track your bot activity. He does custom scripting to set up a script that logs bot activity. They log if they have the http referrer header. You can selectively log certain requests based on footprints. Have the script email you reports. Even then there is manual work you need to do. Some people spoof good bots, so then you need to DNS reverse requests and whois info. Typically in the user agent info, there is a URL with more info. Bot detection strategy, search engines bots almost nver use an HTTP referer header, they use recognizable user agents. Search engine bots come from known IP ranges. Their IPs are normally registered to company. IPLists.coom, WMW has a good forum on it, fantomaster.com/fasvsspy01.html, jafsoft.com and others. What do you do with all this data? Bot control with robots.txt, mod_rewrite for handling, cloaking and banning bad bots. He explains what robots.txt is... He then gives some suggestions for these bots.

William Atchison from Crawl Wall is next up. Check out my past coverage from his presentation at SES San Jose over https://www.seroundtable.com/archives/004336.html. I love his presentation.

Brett Tabke was up next... WMW was knocked offline four times last year because of robots. There were dup content issues. They guess that 1 billion pages views by bots in 2005. WMW is a big target, so they get hit. He shows off the rogue access control system backend module for WMW, pretty cool. They require logins by agents, domains, IPS and referrers. They have a banned list on same criteria, they identify the search engines by domains, IPs and agent names. They also have whitelists by same criteria. The results were 90% reduction in rogue spider activity, 5 TB savings in bandwidth and average page generation time reduced by 50%.

 

Popular Categories

The Pulse of the search community

Follow

Search Video Recaps

 
Google Core Update Rumbling, Manual Actions FAQs, Core Web Vitals Updates, AI, Bing, Ads & More - YouTube
Video Details More Videos Subscribe to Videos

Most Recent Articles

Search Forum Recap

Daily Search Forum Recap: March 18, 2024

Mar 18, 2024 - 4:00 pm
Google Updates

Google Urges Patience As The March 2024 Core Update Continues To Rollout

Mar 18, 2024 - 7:51 am
Google

Official: Google Replaces Perspective Filter With Forums Filter

Mar 18, 2024 - 7:41 am
Google Maps

Google Business Profiles Now Offers Additional Review After Appeal Is Denied

Mar 18, 2024 - 7:31 am
Google Maps

EU Searchers Complaining About Google Maps Features Changes Related To DMA

Mar 18, 2024 - 7:21 am
Google

Google Showing Fewer Sitelinks Within Search

Mar 18, 2024 - 7:11 am
Previous Story: Forums and Communities : Building and Optimization