Brett Tabke Interviewed on Bot Banning

Nov 28, 2005 • 8:08 am | comments (10) by twitter Google+ | Filed Under SEO & SEM Forum News
 

Brett Tabke, owner of WebmasterWorld, has given me the privilege to ask him a bunch of questions on the recent news that WebmasterWorld Bans Search Engine Bots from Crawling. So here it is....

Barry: Brett...Thank you for taking the time during this hectic period at WebmasterWorld to answer several questions about the recent changes you have made, to disallow spiders from accessing your site.

Barry: The big change was that you, on November 18th, changed your robots.txt file to disallow all bots from accessing your Web site. In a thread you started in the Foo forum at WebmasterWorld named lets try this for a month or three... you elegantly linked to your robots.txt file to show people. And subtitled the thread, "last recourse against rogue bots." Why was this the last course of action? I have spoken with dozen of site owners who run sites as large as yours. Most tell me that you can fight off these rogue bots one by one, but you need to factor in the costs of these bots into your hosting prices. How would you respond to that?

Brett: It is difficult to talk about issues that brush shoulders with security related matters. Once you talk about something and your actions to counter that problem in public, you give rise to an invertible counter measure. That said, we have been saying for many years that it was our number one problem on the site. I made a plea in the forums five years ago for a robots Inclusion standard (instead of an exclusion standard).

One thing that sets WebmasterWorld apart from all other similar sites, is the ease with which we can be crawled. There are no CGI parameters on url strings and all off-the-shelf bots can index the site. I can write a 15 line perl program in 5 minutes that will download the entire site - even with cookie support. That same thing can not be said about sites that are not freely crawlable (like other forums and auction sites with cgi based or non standard urls).

The change was for us to require cookie support via member login. That action mandates either allowing the approved big search engine crawlers to feast on a login page instead of page viewing several million pages before they realized the site was 100% different than before. The easiest solution to that is to set a robots.txt ban on all crawlers.

I knew it would be a controversial action. In such cases, it is always better to bring up the subject yourself or least people get the wrong impression that it was by no action of your own. I just threw up the post as a marker so that people knew we'd taken the action ourselves and I would come back later with more information after things settled down a bit. We had started down this road about mid-july when we began blocking many of the major crawlers.

> Why was this the last course of action?

We've tried every thing to stop the bots. Once we got up to several thousand ip's in the system ban list, it was having a serious effect on system performance. We also were occasionally into a situation where we would ban an IP and then that ip would get recycled to another member that had nothing to do with a download attack. It is hard to block an IP such as an AOL ip, because you block several million users using that IP via the AOL proxy cache.

> I have spoken with dozen of site owners who run sites as large as yours.

Size is not the only issue. The ease with which WebmasterWorld can be crawled is first up. I've been studying offline browsers for about a week. All of the site rippers or offline browsers available from Tucows, are able to download WebmasterWorld in it's entirety. Only 6 were able to successfully download part of a Vbulletin site. One would also choke on weird urls (like caps in filenames, or extremely long filenames).

> Most tell me that you can fight off these rogue bots one by one,

Ya, we were spending about an hour or two a day on this problem. I was to the point of hiring one person full time to address it.

Barry: Part of this process, you made a change that now requires cookie support, something that most bots can not support. As a side affect, all members had to relogin to WebmasterWorld. First question, do you have any stats on how many times the "forgot my password" function was used over the past 5 days? :) And my second question is; wouldn't it have been more affective to spend money on a full time server guy to fight off these bots then to lose the search engine traffic completely?

Brett: The majority of people are using browsers such as Opera or IE that auto remember passwords. We also have switched our cookies about once every 60 days for this very reason. That keeps people from leaving cookies laying around in an internet cafe, or on their work machine.

> affective to spend money on a full time server guy to fight off these bots then to lose the search engine traffic completely?

Even hiring a full time guy at this point wouldn't fix the problem. All the tools we have used are only a bandaid solution at trying to cure cancer. We have tried: page view throttling, bandwidth throttling, agent name parsing, cookie requirements from selected ISP's (over 500 including all of Europe/China), IP banning, link poisoning, various auto banning, and various forms of cloaking and site obfuscation to make the site uncrawlable to non-se bots.

The biggest issue, is the massive amount of overhead system and time it takes to manage all that. The totality of it all is staggering. From raw parsing of log files, to code, to server setup, to managing it all takes an inordinate amount of time. It is very easy to make mistakes in all that. (like the time we banned New Zealand visitors because we banned the big ISP's proxy server there) Our site is here for the members - not the rogue bots.

Barry: On that note; almost all the big names in the industry were shocked that you would take this action. They pretty much laughed that you thought you wouldn't be delisted within 30 days, let alone 60 days. Danny Sullivan said;

Brett figures he's got 60 days until pages drop from places like Google to get an alternative search solution in place. That seems optimistic to me. WebmasterWorld is a prominent site and should get getting revisited on a sub-daily basis. If search engines are hitting that robots.txt ban repeatedly, they ought to be dropping those pages in short order, or they aren't very good search engines. I mean, can you imagine the irony of Google and Yahoo getting pilloried on WebmasterWorld for taking so long to drop pages after they were told to do so after the ban was put into place.

Search experts like DaveN, Oilman, SEGuru and others all felt the same way. Why did you really feel it would not happen so fast?

Brett: It has been over 180 days since we blocked GigaBlast, 120 days since we blocked Jeeves, over 90 days since we blocked MSN, and almost 60 days since we blocked Slurp. As of last Tuesday we were still listed in all but Teoma. MSN was fairly quick, but still listed the urls without a snippet.

Google will hang onto a site up to 90 days after you put up a robots.txt ban. Even if the site is completely unreachable, we have seen sites still listed as url only sites up to six months later. It is only via the Google url removal utility where that process will be faster. It is a feature I had not used on Google in many years, and completely overlooked it.

Barry: Also in that summary thread, listed above, you expressed your frustration with the engines for "changing a perfectly good and accepted internet standard." Can you expand on that, and what steps you think they should take to get the robots.txt syntax the way it should be for 2005?

Brett: Without webmaster input, changing the robots.txt standard only encourages others to also play with the standard. Of the offline browser bots I looked at from Tucows, the majority of them can be set to ignore robots.txt. Why, because the standard has not been appreciated, endorsed, or adhered to by the engines as will as well as by the offline browser or site ripper programmers. The engines have fostered an era of robots.txt disrespect.

The engines changing the standard to suit their own needs, is exactly the same as Netscape and Microsoft playing around with the HTML standards during the browser wars. Only by adhering and endorsing standards can we together keep the net from becoming more chaotic than it is now. The enormity of what a webmaster has to already know is already too much for one person. The last thing the internet needs is every big search engine coming out with it's version of robots.txt standard. We need them to support the standard or form an open commission of theirs and our peers to come up with a new one (Which I have been endorsing for 5 years).

That said, as the author of the first robots.txt validator in 1998, I do take the standard very seriously. Hardly a day goes by when I don't get a email from someone asking why their robots.txt with an "Allow" line was marked as bad by the robots.txt validator.

Barry: Due to the fact you are an SEO expert, people came up with wild theories as to why you really did this. Some people said you were banned for cloaking. Some people said that you had a crazy PR stunt in mind. One PR stunt was that the search engines were coming out with a uniform site submission tool and you wanted to be the first to use it. Others said that you wanted to show the search engines that you do not need them. I am sure you heard of many other theories. Which do you find the most funny? Which do you find the most outrageous? And how would you respond to some of them?

Brett: I often forget the scale of how huge WebmasterWorld has become and how many people look to us on issues like this leadership. I have given up trying to disabuse people of notions to contrary why we do things. Not every hat is tin foil and not every helicopter black.

> Some people said you were banned for cloaking.

In order to address many of the rogue site ripper issues, we do openly cloak on the agent level some things. We have to be able to determine what is a good se bot and what isn't. If we randomly go throwing around poison links that lead to autobans without knowing what bot was what bot - we would be banning the se bots left and right. We also use it to keep random ad served content off the page where the only difference is the filename of the image file. That would encourage massive amounts of respidering.

We do everything we can to try to out fox the rogue bots. SE bots were always served the same content as members, and we never IP cloak so it is clear to just about everyone what we are doing. You could always check by a simple agent name switch to slurp. Sometimes we will trip and make a mistake ourselves as there are a few thousand lines of code dedicated to the issues we are talking about.

The number of things needed to address rogue bots is absurd. It was when I was trying to trim down the htaccess ban list to a few thousand IP's after getting hit for 12m page views in a week, that I threw my hands in the air and turned on required login and blocked all the bots. It wasn't a spur of the moment decision, but it was a spur of the moment reaction. If I had it to do over again, the only thing different I would do, is have the new site search engine debugged and ready to go.

> Some people said that you had a crazy PR stunt in mind.

I knew there would be an interest in to to WebmasterWorld members. Some of the other speculation by other noted webmasters was flat out wrong, self interested competitors, and showed a complete lack of understanding of the tech issues involved. One major blogger suggested that we could address all this with a couple of bans in the httaccess list. I laughed when I listened to it, because we had close to 4000 IP's in there and were on the very of banning entire C blocks and all of the AOL proxy servers. Clearly, the tech issues were well beyond his knowledge.

> Others said that you wanted to show the search engines that you do not need them.

Yes, a hundred thousand targeted referrals a day are just plain wrong. Lets cut to the chase; I adore search engine traffic, but my first duty as a webmaster is to the visitors and members of our site. Anything that interferers with that to the degree that rogue spiders, downloaders, offline browser, monitoring services, site rippers, or whatever you call it - I have to take action.

> I am sure you heard of many other theories. Which do you find the most outrageous?

That I was starting a bot busting service that I had talked about 4 years ago in the forums.

> And how would you respond to some of them?

I would not respond to it. My first and only duty is to the members and visitors of WebmasterWorld. Anything I can do to enhance their experience at the site is our goal. That viewpoint is what built WebmasterWorld and what will sustain it. Take care of your members first, and everything else will take care of itself. The more transparent we can make the tech, the better it works for everyone.

Barry: Do you expect support from Google, Yahoo, MSN and Ask Jeeves to get back into their indexes quickly? Have they offered you any support or advice?

Brett: > Do you expect to get back into their indexes quickly?

No different in that regard than any other website.

> Have they offered you any support or advice?

Yes, they have been great off-the-record. It isn't something they can talk about in public either. I am saddened by that fact, but I do understand the big sites simply can ill afford to talk about security or tech issues that can have a negative effect on their own system in public. I was asked in Vegas why we had banned so many engines - clearly, they had taken notice - but no one had a answer except to ask why G was still allowed on the site.

Barry: What are your plans for the next 7 days? And then the next month or two in terms of these rogue spiders and non rogue spiders?

Brett: There has to be 5 pounds of turkey in the fridge and I think the last half of the pumpkin pie will be done by the end of the day ;-)

Other than that, I have a site search engine to finish debugging and then we have an open house at our new offices, Christmas travek, PubCon Australia, new employees training, and a spring PubCon in Boston to plan and flush out. Interesting times indeed!

> And then the next month or two in terms of these rogue spiders and non rogue spiders?

We have made alot of changes to the core bot detection architecture this last week. The members have been so helpful and giving with new ideas and new ways we can address the problem. There is no one magic bullet that is going to fix the problem, but a more polished approach using all the techniques is what we are working on. People have gone so far as to write custom code for us to use free of charge.

The one thing I would like to leave people with, is to download a few of the site ripper programs and run them against their own site. Test how easily their site can or can not be crawled. There will be site owners that will be shocked to see their site is either completely crawlable without regard for robots.txt, or uncrawlable because of various site architecture problems. There is something there to be learned by every site owner.

Barry: Well thank you for spending the time answering my questions. I wish you all the best and I hope everything works out in the long run.

Brett: Thank you.

Previous story: Engagement Party & Sandbox ;)
 

Comments:

WilliamC

11/28/2005 02:44 pm

Site ripper spiders dont give a damn about robots.txt Nuff' said.

Chris Beasley

11/28/2005 03:18 pm

Ya, I suppose Brett though banned the siterippers with the cookie requirement and then simply used robots.txt to inform the legitimate bots. Still, larger sites have to deal with the same issue and do not need to block all robots. I don't know if they already do this, but I would implement a caching system so that each page view does not result in backend scripts firing. Then tweak your web server, maybe switch from apache to mini-apache or lighttpd. Finally some good old fashioned throttling or user-agent detection. Sure, user-agents can be faked, but in my experience most site ripper users are too lazy to do it. In anycase its a simple fix that might block perhaps half of the rogue bots. Throttling should make sure the rest don't overload your server (assuming you're using a reasonable hardware setup). The point being that just serving vanilla html pages is not resource intensive. You might need a load balanced cluster of servers, but so what, surely they can afford it.

WilliamC

11/28/2005 03:27 pm

Adding in cookie support to a site ripper is as easy as use LWP; in perl, and using curl in php. The cookie requirement is also worthless. The only way we have found that works well, is to watch for bot patterns in: UA, Referer, Proxies, Cookies, patterns of movement, [x] hits per minute, IP range, etc. Any spider not in an allowed IP range (ie: googles) can be blocked in minutes, thus not causing much if any BW loss. The measures put into effect by Brett, in my opinion, wont do much, if anything to solve that problem, and have cost him a ton of SE listings. Seems like he hacked off his nose to remove a zit on it.

Brett Tabke

11/28/2005 03:40 pm

First - thanks Rusty. > Seems like he hacked off his nose to remove a zit on it. Yes, I was completely fedup with it all. Between all the number and depth of things we have to do to address the issue and appeasing the engines - it was getting into the area of the absurd. There are 10 major code decision points and several hundred data items per those decision points That is a lot to manage and keep mistake free. We were making a lot of mistakes - from autobanning all of AOl - to banning New Zealand, it is pretty easy to screw up when things are automated. The other part was having just been through all this code when moving to the new server, it was so nice to remove it all for awhile. It is amazing to see what the site could do on it's own 2 feet. > The measures put into effect Making the site uncrawlable is the final step.

Seg

11/28/2005 05:29 pm

To Brett : wasn't is possible to cloak your robots.txt file in order to allow only "good bots" as a white list ?

GreenWood

11/29/2005 02:27 pm

good thought about robots.txt

Search Engines Web

11/29/2005 10:25 pm

http://searchguild.com/tpage24532-0.html It is interesting the Theories that came up among SEOs and Web Developers. This interview will become a classic.

Targetseo.com

11/30/2005 09:12 am

Thanks Brett and Barry to clear this issues. But it's really jagger new!

Veronica

09/29/2006 11:07 am

The summit would be excellent on search engine robots.

Will

03/18/2009 07:37 am

Any spider not in an allowed IP range (ie: googles) can be blocked in minutes, thus not causing much if any BW loss.

blog comments powered by Disqus