Bot Obedience Course

Dec 5, 2006 • 5:28 pm | comments (0) by twitter | Filed Under Search Engine Strategies 2006 Chicago
 

This session is moderated by Detlev Johnson, who is the Director of Consulting for Position Technologies.

Detlev starts off by explaining that the session will cover how to help the search robots index your site and also keep out the bots which you do not want using up you bandwidth.

Jon Glick from Become.com starts by explaining that bots don't analyse or rank a site, they simply go out and grab content. Using forms or pure flash will halt the bots straight away, so use text links to get around such features. Robot traps are page features which cause infinite loops and duplicate content issues, such as calendars with text links which robots can keep clicking on month-by-month to infinity. You can get around these issues by removing session id's, using forms/javascript for links you don't want followed and using robots.txt. Robot.txt allows you to tell spiders where to go and where not to go, including specifying each robot based on their name (e.g. Googlebot). Meta Tags can also be used instead of (or in conjunction with) robots.txt to tell a spider what to do on a specific page; such as not to follow links or not to index the page.

Well-behaved bots obey robots.txt and meta tag commands, identify themselves with a unique name (they don't pretend to be someone else) and respect the Interlectual Property of content on your site. Bad bots ignore robots.txt and meta tags and can sometimes hammer a server with page requests or scrape content from your site. You can block bots by banning an IP address that is abusing your site by blocking them for 24 hours, if they continue to abuse the system you should block them at firewall level permanently. A image verification challenge can also be used if you suspect a bad bot, although this could annoy false-positive visitors and should be made usable for the colour blind or the disabled. Be careful who you block though, as bots sometimes change their names (such as when Yahoo changed across from Google's index to their own backend). You should also be aware of less known bot UserAgents such as the Yahoo Shopping spider and Google's variety of bots (details of which can be found on their websites). The benefits of controlling bots include better indexing and makes duplicate content less likely.

Dan Thies is next up to the stand. Starting on duplicate content, some of the issues include using www.domain.com and not redirecting domain.com, pages which are almost the same, near empty pages etc. To find out whether a spider is genuine, you can reverse lookup the IP address of the incoming spider and check to see what domain name it's using e.g. bot1234.google.com. Dan uses the ARIN.net WHOIS service at his company as an alternative method, which check who the IP address is owned by. On dynamic websites, you could automatically insert a "nofollow, noindex" Meta Tag on the entry page, although be careful that this does not block good bots. You can check duplicate content via Google Alerts and Copyscape. You can kill duplicate content by sending DMCA (Digital Millennium Copyright Act) notices to web hosts, site owners and the search engines. Why are any URLs from known proxies and scrapers still indexed by the major engines?

The next speaker is Bill Atchison of CrawlWall. His website was under constant bot attacks and scraping accounted for 10% of his page impressions, not counting Google, Yahoo and MSN. Copyrighted material was scraped, stolen and used on spam websites. He therefore decided to build a system to stop them. Bad bots want to make money out of your content, they are effectively hackers. Some of the bots are used to check for copyright infringement, intelligent gathering and some of the data is even sold to the US Government. Bill keeps bots under control by using opt-in rather then opt-out in robots.txt. Allow Googlebot, Slurp, MSNBot and Teoma, then block everything else (unless you know of niche engines which you'd like to be included in. Using a customised firewall, you can also only allow through IE, Firefox, Safari and Opera although block other unknown browsers. He uses image verification on some bots to check if they really are humans. Don't allow search engines to cache your content (using the noarchive meta tag) as spammers do scrape the search engine cache to gain content. He then goes on to list the various checks and tests which the software "CrawlWall" uses to test and analyse robots.

Tim Converse from Yahoo steps up to talk about the robot Slurp. Although it's not a bad bot, sometimes Slurp gets a little excited so they do support a new protocol in robots.txt which controls the indexing speed. He noted that although he agrees with the opt-in principle, users should make sure that all major engines which you'd like to get indexed by are included in the policy file. Yahoo supports wildcards and pattern based commands in robots.txt which can exclude session IDs etc, information is available on their website. Slurp uses yahoo.com, inktomisearch.com and alibaba.com for its spider domains, so if checking the hostname of an IP address - make sure you allow all of these and not just yahoo.com. Further information is available at http://help.yahoo.com.

Vanessa is next up and does not have a presentation, although plans to show how Googlebot (the Google spider works). Google uses a variety of spider names although you can disallow one and allow another. Although the spiders have access to a shared cache of pages, if you disallow Googlebot it will not use the cache. Webmaster Tools is then mentioned which has tools for features which are not possible via the robots.txt.

The internet connection then went down which cut the presentation a little short.

These posts may have spelling and grammar issues. These are session notes, written quickly and posted immediately after the session has been completed. Please excuse any grammar or spelling issues with session posts.

Previous story: Mobile Search Optimization
 

Comments:

No comments.

blog comments powered by Disqus