This should be an interesting session, Danny Sullivan is moderting this session. We have Jon Glick, ex-Yahooer now at Become.com, also have Bill Atchison, Dan Thies, Rajat Mukherjee of Yahoo and the new famed Vanessa Fox of Google. Brett Tabke is on my right, talking with Jon now about bad bots. Tim Converse is one row behind me on my left. Danny mentioned Brett's fight with bots and had Brett wave at the crowd.
Jon Glick up first. Robots are good at finding links and pulling that content. Bots pull the content but don't do the analysis. Bots are dumb, finicky and they cannot type. Bot friendly sites includes hypertext navigation, well ordered-hierarchical site and clear instructions in your robots.txt. Robot traps include; dynamic content, excessive parameters, and perpetual calendars. Use a robots.txt file to tell the bots what to or not do on your site. You can also used meta tags on a page by page basis, or you can also use the rel nofollow attribute. Well-behaved bots obey the robots.txt file and metatags, they identify themselves (they dont spoof), they dont crawl too aggressively, they provide FAQs, etc. When bots go bad, the most evil bots don't obey the robots.txt and metatags info. How do you detect these? look at your daily logs, do some real-time analysis. Dealing with misbehaving bots; don't hesitate to block them, sometimes just do a 24 hour block, block at the firewall level. You can also try put up a challenges, such as a text code in an image. Be careful who you block; track who gives you traffic.
Dan Thies from SEO Research Labs. Duplicate content is the same content presented on more than one URL. Most web sites do this to themselves. There is also near duplicate content also. There is a difference between filtered from the index and filtered from the search results. Duping yourself; duplicate URLs, shopping sites and near empty pages. Getting Duped; by screen scrapers, RSS feeds and proxy URLs. The impacts on traffic... 10 - 15% of traffic is organic search. After de-duping the site, 20 - 25% came from organic search. Revenue drop was "feelable." Reverse cloaking vs. scrapers: simple user agent detection, if the user agent is not a major SE spider insert; meta name="robots" content="noindex". Screen scrapers that steal an entries pages's HTML get a page that will not be indexed. Easily thwarted by someone who cares to but reduces duplication by scraping substantially. Links by proxy is an old trick. Hack someone else's site to create a link or redirect to one of your sites - either create a page or credit a URL using XSS attack... then link to it using a proxy URL. There are also public proxies that you can use. Proxy URLs as duplicates; thousands of public proxy servers, every URL on the web can be duplicated by them, proxy based duplicates when link to can affect duplicate content filtering. Public proxies pass along the user agent but proxies use their own IPs. How do you stop them? Spider validation vs. proxies; when you get a request from a search engine spider user agent, check the requesting IP address. This is dangerous so use with caution. But what if they get through? Change and rotate content; testimonials, news and headlines and use brute force. The most important page on your site is probably the home page, yet it is the least likely to get changed often (hmmm). Monitoring Dupes; set up monitoring for a signature SERP text that is unique to your pages, home page duplication is the #1 issue, use a second signature for internal pages and he then lists some tools. You can use the DMCA, digital millennium copyright act. Send the hosting provider or the search engines. I'll leave off the challenging the search engines slide.
Rajat Mukherjee from Yahoo! is now up. Yahoo! Search Web Crawler is named Slurp. He has news about the new site explorer features. New features include; you can add your site, you then can authenticate your site (looks so much like Google Sitemaps), to authenticate, you place a file on your site and that will authenticate you. You can manage site feeds, rss feeds, etc. In addition to those standard features they added a subdomains filter, a different view of those results and a way to get those data out of the system via flat file or API.
Rajat Mukherjee from Yahoo! then moves on to bot obedience. Slurp is a very obedient bot he said. Read robotstxt.org. He showed us a photograph of slurp, a joke of course. Make sure you allow content you want Yahoo! to get and disallow content the content you dont want them to index. Yahoo! does honor a crawl delay parameter. http://help.yahoo.com/search is very well organized there, plus some new resources added there. Slurp is new and better, they announced it last week. They show the blog posts from Yahoo Search Blog and Loren Bakers blog from 7.28.2006 - where you should see up to 25% reduced load on your sites. He asks who have seen a reduction of load, and about 1.5 people raised their hands out of hundreds. Yahoo! does have multiple crawlers, but please send feedback to Yahoo about these crawlers.
Vanessa Fox from Google is last up. She put up some funny robots.txt files she found, she had no real slides. She talks about google.com/webmasters they announced last week, a tool to check your robots.txt file. She talks briefly about the www vs. the non www issue, which is now at the google.com/webmasters, that allows you to define which is the proper structure, www vs. non www. Every once and a while a host may block a googlebot IP.