Blocking Your Robots.txt In Google's Search Results

Dec 27, 2012 • 8:42 am | comments (7) by twitter Google+ | Filed Under Google Search Engine Optimization

Have you ever seen your robots.txt file return a match in the Google search results before? Does that upset you? Did you know you can block Google from displaying that file in the search results without blocking Google's access to that file?

This is not a new topic, in 2008 we polled our readers asking if Google should display the robots.txt file in search results and the overwhelming majority said no, Google should not show the content in the search results.

But here is a picture of the robots.txt file showing up in the search results for this site:

robots.txt in google results

But there are ways to block the robots.txt file from showing up in the search results. John Mueller of Google explained it about five years ago:

Hi guys, there are two ways to block your robots.txt from showing up in search results:

- disallow it in your robots.txt (don't worry, we'll still check it); you can then use the Webmaster Tools URL removal tool to have it taken out of the index if it's indexed.

- use the x-robots-tag HTTP header tag with "noindex"

On the other hand, robots.txt URLs generally would not show up in any search results where you have more relevant pages within your site, so this is probably not something you'd want to spend all too much effort on :-).

Forum discussion at StackOverflow and also that old WebmasterWorld thread.

Previous story: Searchers Can't Exclude Keywords From Google Image Search



12/27/2012 05:34 pm

Yeah, it seems counter-intuitive to add Disallow: /robots.txt to your robots.txt file, but it works. The robots.txt directive stops it being fetched for indexing purposes but does not stop it being fetched to find out what folders and files are allowed and not allowed to be indexed on your site.

John Britsios

12/27/2012 06:17 pm

Google always could index robots.txt and also can accumulate PageRank. I had this discussion at the forums WebProWorld back in 2007: At least to prevent indexing of my robots.txt, sitemap.xml, and other files, I include over the years in my .htaccess the X-Robots rules: Header set X-Robots-Tag "noindex,noarchive,nosnippet,follow" Attention: The X-Robots is not correctly displayed in the browser. The last 3 characters (="") should be be included. I am sure you know what I mean. Adding in the robots.txt a Disallow: /robots.txt does not make any sense to me. g1smd can you explain what that should do?

Phillip Marquez

12/28/2012 01:49 am

You know that Google has suggested allowing their crawlers access to your CSS and JS files now right? Edit: Never mind, I just went over your .htaccess example and see you're not blocking access, just no-index/archive/snippet. :)

John Britsios

12/28/2012 01:55 am

Very well done Phillip. ;)

Zach Kasperski

12/28/2012 04:22 am

Dang. This hasn't really ever crossed my mind, but I definitely learned something. Thanks! And for anyone who doesn't know what a /robots.txt file does or how to check one: Search engines use robots or spiders or crawlers to index pages on the Internet. Google calls their robot “Googlebot,” and its job is to visit every single web page on the Internet and collect information correlated with each page. Website owners use the robots.txt file to give instructions about their website to robots; this is called the “Robots Exclusion Protocol.” CHECKING FOR ROBOTS.TXT 1. Go to your website’s home page. 2. Enter “/robots.txt” following your home page’s URL. It should look something like this: 3. If you have a /robots.txt file, you'll see a page with some black text with: Disallow: /SomeWebPage If you don’t have one, you’ll most likely be redirected to a 404 Error Page. 4. Check to see if any valuable or authoritative content is disallowing the robot to crawl the page. Ie. “/” or “/about”

André DeMarre

12/31/2012 07:14 pm

Any ideas on blocking sitemap.xml from search results?

Oleg Korneitchouk

01/02/2013 08:05 pm

Crawlers will always read and follow instructions contained in your robots.txt file (unless they are rogue). Adding the Disallow: /robots.txt prevents it from being indexed rather than blocking access to the file entirely. I prefer your .htaccess code though. Cheers

blog comments powered by Disqus