Will Google Crawl Your Site Without a Robots.txt File? It Depends

Jun 18, 2008 • 7:26 am | comments (4) by twitter Google+ | Filed Under Google Search Engine Optimization
 

I found a very interesting tidbit from a Google Groups thread on unreachable robots.txt files.

I always believed that a site does not need a robots.txt file. In fact, this site does not have a robots.txt file and yet we are very well indexed. Proof that you don't need a robots.txt file to allow Google to index your site. Right?

Well, maybe not. Googler, JohnMu, said in the Google Groups thread that if your robots.txt file is unreachable due to timing out or other issues, not including a 404 not found status, Google "tends not to crawl the site at all just to be safe."

You hear that? Google might not crawl your entire site if it cannot reach your robots.txt file properly.

In the case in the thread, the robots.txt file was unreachable due to a complex set of redirects that made Googlebot very dizzy.

John explains later on that "unparsable robots.txt" files are "generally" okay, since Google is getting back some type of server response. When you have an issue is when generally "the URL is just unreachable (perhaps a "security update" that ended up blocking us in general) or situations like this where we give up trying to access the URL (which in a way is unreachable as well)," said John.

So, for those picky Belgium's, just make your robots.txt file unreachable, and there you go. :)

Forum discussion at Google Groups.

Previous story: Daily Search Forum Recap: June 17, 2008
 

Comments:

Igor The Troll

06/18/2008 12:04 pm

Barry harnessing Googlebot! lol I think it is a wild animal that cannot be corralled! I came across some instances where it even picked up an .htaccess password protected url. I have no idea how it sniffed it out, because I do not broadcast it. Maybe my search history? Well keep trying to tame the mad stallion. If you succeed do let us know.

sante

06/18/2008 02:54 pm

I have seen this happen on several Client sites recently - pages that were scheduled to be spidered we not because the robots.tx file was not downloadable - in fact the site was down, consequently robots.txt wasn't available - I hadn't witnessed this before ...

Rob Abdul

06/21/2008 09:21 am

Igor, chances are that someone with Google Toolbar with pagerank enabled visited the .htaccess password protected URL. That tipped off the Google Gods. Sante, my friend if a site is down, then it’s all over anyhow. Googlebot or any other spider cannot index your site if your site is down, despite whether the robots file exists or not. However, Google, MSN, and Yahoo do index your site without a robot.txt file. For SEO reasons it is best practice to have a robots.txt file. I will challenge anyone in the world to this, I can prove it! There are times when due to the Client or deadlines I am forced to upload a new site before it is complete. I take a judgment call as to when all the pages are optimised or not when it can be crawled etc. Based on this I usually block all spiders in the robots file. So in this example I use the existence of the robots file to deliberately block the site from being indexed until it is ready. The goal is to make the visit by any spider to you site very easy and straight forward. Your card is marked so to speak if a spider cannot get a robots file.

Mirandawxs Smith

03/28/2011 01:23 pm

Useful information like this one must be kept and maintained so I will put this one on my bookmark list! Thanks for this wonderful post and hoping to post more of this!

blog comments powered by Disqus