The Bot Obedience Course - New Yahoo! Site Explorer Tool Announced

Aug 8, 2006 • 5:37 pm | comments (1) by twitter Google+ | Filed Under Search Engine Strategies 2006 San Jose
 

This should be an interesting session, Danny Sullivan is moderting this session. We have Jon Glick, ex-Yahooer now at Become.com, also have Bill Atchison, Dan Thies, Rajat Mukherjee of Yahoo and the new famed Vanessa Fox of Google. Brett Tabke is on my right, talking with Jon now about bad bots. Tim Converse is one row behind me on my left. Danny mentioned Brett's fight with bots and had Brett wave at the crowd.

Jon Glick up first. Robots are good at finding links and pulling that content. Bots pull the content but don't do the analysis. Bots are dumb, finicky and they cannot type. Bot friendly sites includes hypertext navigation, well ordered-hierarchical site and clear instructions in your robots.txt. Robot traps include; dynamic content, excessive parameters, and perpetual calendars. Use a robots.txt file to tell the bots what to or not do on your site. You can also used meta tags on a page by page basis, or you can also use the rel nofollow attribute. Well-behaved bots obey the robots.txt file and metatags, they identify themselves (they dont spoof), they dont crawl too aggressively, they provide FAQs, etc. When bots go bad, the most evil bots don't obey the robots.txt and metatags info. How do you detect these? look at your daily logs, do some real-time analysis. Dealing with misbehaving bots; don't hesitate to block them, sometimes just do a 24 hour block, block at the firewall level. You can also try put up a challenges, such as a text code in an image. Be careful who you block; track who gives you traffic.

Dan Thies from SEO Research Labs. Duplicate content is the same content presented on more than one URL. Most web sites do this to themselves. There is also near duplicate content also. There is a difference between filtered from the index and filtered from the search results. Duping yourself; duplicate URLs, shopping sites and near empty pages. Getting Duped; by screen scrapers, RSS feeds and proxy URLs. The impacts on traffic... 10 - 15% of traffic is organic search. After de-duping the site, 20 - 25% came from organic search. Revenue drop was "feelable." Reverse cloaking vs. scrapers: simple user agent detection, if the user agent is not a major SE spider insert; meta name="robots" content="noindex". Screen scrapers that steal an entries pages's HTML get a page that will not be indexed. Easily thwarted by someone who cares to but reduces duplication by scraping substantially. Links by proxy is an old trick. Hack someone else's site to create a link or redirect to one of your sites - either create a page or credit a URL using XSS attack... then link to it using a proxy URL. There are also public proxies that you can use. Proxy URLs as duplicates; thousands of public proxy servers, every URL on the web can be duplicated by them, proxy based duplicates when link to can affect duplicate content filtering. Public proxies pass along the user agent but proxies use their own IPs. How do you stop them? Spider validation vs. proxies; when you get a request from a search engine spider user agent, check the requesting IP address. This is dangerous so use with caution. But what if they get through? Change and rotate content; testimonials, news and headlines and use brute force. The most important page on your site is probably the home page, yet it is the least likely to get changed often (hmmm). Monitoring Dupes; set up monitoring for a signature SERP text that is unique to your pages, home page duplication is the #1 issue, use a second signature for internal pages and he then lists some tools. You can use the DMCA, digital millennium copyright act. Send the hosting provider or the search engines. I'll leave off the challenging the search engines slide.

Bill Atchison from CrawlWall.com is now up. He calls these bad bots, parasites. He said one day, a scraper took down his server. 10% of his traffic was from bad spiders, these parasites. Bad bots ignore robots.txt, spoof bot names, use multiple IPs. They want to get your data to make money. Motivations include, AdSense, YPN, affiliates. Who are these bots? Intelligence gathering bots, content scrapers, data aggregators, link checkers, privacy checkers, etc. Stealth bots vs. visible bots - visible bots are easy to block, the stealth bots are those masking as humans. How scraper bots use your content? He created the name CrawlWall to easily find pages that were unique to that keyword. He used that to locate sites that stole his content with the term CrawlWall. They took several web sites and scrambled the content together, to serve up Google AdSense. He sometimes feeds them back cookie information, so he can then track them better. He logs all this activity. Scrapers also cloak and hide your content. He shows two active proxies that hijack content, that crawled as Googlebot. How do you stop bots? Opt out bot blocking fails; robots.txt only works for the well behaved bots as the most bad bots ignore robots.txt except when trying to avoid spider traps. He went to an optin strategy. He said, only Google, Yahoo, etc. can come into my web site but that can get you intro trouble. You need to review your traffic prior to doing this. He finds Google Analytics very useful. He created a lot of rules to determine the difference between stealth bots versus a visitor. Some bots use cookies, very few bots execute JS, bots hardly every examine CSS files, rarely do bots download images, monitor speed and duration of site access, observe the quantity of page requests, and so on. He will then serve up a image access code to them. Robots.txt is spider trap because stealth crawlers reading this file expose themselves while trying to avoid spider traps. Also anyone visiting your privacy pages, it is probably not your visitor. Avoid search engine pitfalls; dont allow search engines to archive pages as search engine cache is also scraping target. People are also scraping through translation tools. Ways to rpotect your site: USe a script to dynamically display robots.txt and show proper info to allowed bots and all others see disallow. USer agent filtering and blocking with the rules structured for an OPIN allow list. Block entries IP ranges for web hosts that host or facilitate access for scraper sites. For blocking large lists of IPs, such as proxy lists, use PHP and a database like mySQL.

Rajat Mukherjee from Yahoo! is now up. Yahoo! Search Web Crawler is named Slurp. He has news about the new site explorer features. New features include; you can add your site, you then can authenticate your site (looks so much like Google Sitemaps), to authenticate, you place a file on your site and that will authenticate you. You can manage site feeds, rss feeds, etc. In addition to those standard features they added a subdomains filter, a different view of those results and a way to get those data out of the system via flat file or API.

Rajat Mukherjee from Yahoo! then moves on to bot obedience. Slurp is a very obedient bot he said. Read robotstxt.org. He showed us a photograph of slurp, a joke of course. Make sure you allow content you want Yahoo! to get and disallow content the content you dont want them to index. Yahoo! does honor a crawl delay parameter. http://help.yahoo.com/search is very well organized there, plus some new resources added there. Slurp is new and better, they announced it last week. They show the blog posts from Yahoo Search Blog and Loren Bakers blog from 7.28.2006 - where you should see up to 25% reduced load on your sites. He asks who have seen a reduction of load, and about 1.5 people raised their hands out of hundreds. Yahoo! does have multiple crawlers, but please send feedback to Yahoo about these crawlers.

Vanessa Fox from Google is last up. She put up some funny robots.txt files she found, she had no real slides. She talks about google.com/webmasters they announced last week, a tool to check your robots.txt file. She talks briefly about the www vs. the non www issue, which is now at the google.com/webmasters, that allows you to define which is the proper structure, www vs. non www. Every once and a while a host may block a googlebot IP.

Previous story: Blog and Feed Search SEO
 

Comments:

Ted

08/17/2008 03:49 am

Superb information. Especially how to properly use your robots.txt file for bots. Also, Yahoo's Slurp does seem to obey a robots.txt file, but takes forever to crawl a site even with a proper site map and authenticated account.

blog comments powered by Disqus