Robots.txt Summit

Apr 11, 2007 • 6:04 pm | comments (1) by twitter | Filed Under Search Engine Strategies 2007 New York
 

This session allows the search teams at Ask, Live, Google, and Yahoo to provide input about various robots.txt files and asks the audience about how to improve upon the robots.txt standard. It is very discussion-based and representatives from the Big 4 ask for input in a variety of different areas related to robots.txt (and sitemaps).

Presented by: Keith Hogan, Ask.com Eytan Seidman, Live.com Dan Crow from Google Sean Suchter from Yahoo!

Moderated by Danny Sullivan

Keith Hogan first presents the Ask.com company profile. Less than 35% of servers have a robots.txt files. The majority of robots.tx files are compied from one found online that is very generic (2.5M hosts have this file). The robots.txt files vary in length to 1 character to over 256,000 characters.

He shows a histogram of the sizes - the peak is at 23 characters. 11% has this amount: User-agent: * Disallow: /

Robots format is not well understood. He shows a screenshot that has a funny comment saying "Please use during off-peak hours."

Recent addition to robots.txt where you can add a sitemap directive: SITEMAP: http://www..xml

Both robots and sitemaps are kind of linked together. The accomplish similar goals. Sitemaps allow webmaster page level control to identify pages, etc.

Possible changes/additions to robots: is it time to change the format to XML? It would improve accuracy control, perhaps: crawler allow groups/disallow groups, allow paths, disallow paths.

Another possibility is to have peaks during daytimes, valleys at nighttimes, etc. Should webmasters be able to stop/slow crawling of the site during differnet times?

Perhaps it is better to specify a start crawl time and end time.

Some sites have hosts/IPs that are dedicated to crawlers and tell them to visit certain sites.

HTML provides meta directives - noindex, nofollow, nocache, noarchive. Should this be added to robots?

Another thing is spider traps or duplicate content for crawlers even though there are plenty of heuristics to identify these problems - session IDs, affiliate IDs. Should robots add hints for this so that sites don't end up with duplicate pages and smaller link credits?

You can find the Ask crawler information from the About page.

Question for audience: If you had your own machine and website, what is your interaction with your hosting company and how can you control crawling your site?

Eytan Seidman from Live.com presents next. He asks how many people use robots.txt. He shows us the hilton.com robots.txt file that says "Do not crawl the site during the day!"

A big part about websites to search engines is communication. Search engines have no good way of communicating with websits through robots.txt. There should be a protocol to facilitate this communication in robots.txt.

Robots.txt's protocol is very complex. Engines don't support a common set of control. There is some commonality but it's not as good as it could be.

Dan Crow from Google speaks next.

He speaks about the robots.txt exclusion protocol - robots.txt and robots meta tags. Tells search engines what not to index. The exclusion protocol started in 1994 and is the de facto standard in the industry. There are still significant changes between search engines

Standardization: Should we revive the standardization effort? Common core features as they exist/defined extension mechanism.

Long-term goal - consistent syntax and semantics/improved common feature set.

Sean Suchter, director of search technology at Yahoo speaks next. He says that the Yahoo spider is Yahoo Slurp which supports all standard robots.txt commands. There are custom extensions, such as crawl-delay, sitemap, wildcards. There are custom meta extensions, such as NOODP and NOYDIR. He adds that different Yahoo search properties use different user agents, so if you are trying to affect one robot, please only address that robot. You want to be careful - depending on how you use your robots.txt, you will have different effects on different robots and can lose out on traffic from some type of search.

One question that he has for us is regarding the crawl-delay. How should this be rate limited? A crawl delay actually seen "in the wild" of 40 seconds means that Yahoo can never crawl a large news site. Is it about bandwidth reduction? In what manner is this used?

Another one that is floating around - robots-noindex and robots-index - this goes in your HTML page that mark pages that you don't want the robots to use for purposes of retrieval. For example, templates or ad-text that would cause irrelevant traffic.

The last question he has is about complex HTML and CSS, iframes, etc., there's a lot more than the page that is useful - how would users want to emphasize or exclude this?

Questions - Danny asks: Is it better to have robots.txt in an XML format? Some guy from the audience says that it should be part of the sitemaps standard if this is a requirement. Another guy says that he likes the XML idea for the nature of his business. He mentions that XML could underline parts of his site that have duplicate content issues. He says that he wants to know if there are any tools that display any pages that have duplicate content.

The first guy who responds adds that he is concerned about people being able to authenticate through robots.txt to only allow spiders that people trust. Danny says that there is the ability to do reverse DNS lookups, but he acknowledges that this is a pain. How can the robots.txt be improved to allow for authenticated spiders?

Danny asks about timezone control. Not many people seem overly concerned about timezone control.

A woman says that she wants the META exclusion to be available within the HTML "absolutely." She says that she actually has clients who cannot add robots.txt files to their root directory for whatever reason.

A man asks if robots.txt is optional. Dan responds to say "yes, there are." But some search engines may ignore it though. Dan says that content will be crawled unless you tell the spiders not to crawl the content.

A developer says that he wants to be able to ignore dynamic content completely.

Danny asks about the crawl-delay and asks if all of the Big 4 have an exact definition across the board. They all look at each other quizzically and don't know. Sean says that it's probably not used the same across the board. Some webmasters don't use it correctly; it hurts their site. Crawl-delay should be defined as page loads per second or queries per second, megabytes per day, megabytes per month - but there is still no definite answer. When suggesting megabytes per day or MB per month, the audience gets all muttery and Danny responds, "We don't like that!"

Keith says that there is anecdotal evidence that shows that some sites have different robots.txt files that are served at different times of the day. Sean adds that "if you do this, we have to crawl it every minute, and how many people would like that?" Since some people actually need a time-of-day restriction, this is very bad to swap robots.txt files. "We'll note that, but please don't swap the robots.txt files."

An audience member says that the day of week is more important for him.

Danny asks how many people were hit by a scraper site, and a few people raise their hands.

An audience member says that spiders should adjust their crawling based on server response times, and Dan says that Google does this.

Another audience member chimes in about the XML standard and says that doing an XML based format will be more easily messed up. Dan says that this is one of the main reasons why they are against doing this XML format.

Dan says that some of the robots.txt files he sees (75,000) contain a jpg.

An audience member asks about legal jargon on his financial site with over 40,000 pages. He wants people to ignore parts of a site with the legal jargon.

Dan asks if standardization would be useful and people raise their hands and say that it would be.

An audience member says that she likes the crawl rate but she wants a robots.txt option for a crawl rate that can be slow, medium, or fast.

Another audience member has privacy concerns about robots.txt. She says that because "robots are not indexing the page, but you're still indexing the URL, we have PPC links that are easily accessible in Google and that's a click-frauders dream. Do you have comments on a separate standard where we won't even index these URLs?" Also, she asks if there's a way to have a Webmaster Central way of defining your robots.txt so that you don't have to have a public file.

Eytan says that Microsoft is looking at that. It's not a great user experience to show a URL and description so they do want to know how to optimize it.

Sean says that on Yahoo, you can delete URLs or paths in SiteExplorer and that will affect any indexing. Dan says that Google has the equivalent as well. Vanessa Fox from Google (who is in the audience taking notes) says that you can do this on the page itself so that it cannot be indexed at all.

Danny asks about who should be the priority - is the siteowner always right (unless it's the homepage)? There is mixed reaction in the crowd.

Someone in the audience says that there should be more meta commands - don't crawl me but list my URL, for example. A few people actually want a lot more options, but Danny says that there are problems that can result from these options.

Dan says that people have complained to Google that their pages weren't being crawled but it was really the fault of their robots.txt files that excluded the spiders. Sean finds this hilarious: he says "check the robots.txt file!"

Someone in the audience talks about the sitemaps.org standard and asks what would happen if you didn't include everything to be allowed in the sitemap but don't necessarily exclude it on robots.txt? The speakers agree that this information will not be excluded. Keith reiterates and says that if it excluded in robots.txt, it will be blocked, but if it's not in the sitemap, it will still be crawled.

Feature request: ability to tell people about dynamic session parameters and that search engines should not crawl them. Sean says that you should redirect content with for Yahoo's robot to another page without the session ID. It is not the best solution, but if you see Yahoo slurp coming, you could redirect to the right page without tracking codes or session IDs.

Sean reiterates part of his presentation and asks how robots should be used to focus on CSS, iframes, etc. Nobody has said anything thus far so he wanted some input. Dan says that Google looks at CSS, but a lot of people actually do block css files through robots.txt. Dan says that the CSS should not blocked because it is occasionally used by Google to find out if people are using spammy content (white text on white text, for example).

Sean asks why people in the audience are blocking off CSS files. Nobody really has an answer but one guy says that he puts all the non-search-engine-essential content in a folder (CSS, Javascript, etc.) and robots.txt tells the entines not to crawl that folder.

Note: Please excuse typos, this coverage is provided live from a complete discussion with much audience interaction.

Previous story: Search and Branding
 

Comments:

Sebastian

04/12/2007 08:08 pm

Tamar, just in case you didn't get the twitter msg: Thanks for your great coverage. Steering crawlers is a pet of mine and I wasn't able to make it to NY.

blog comments powered by Disqus