Why Google May Not Find Your Sitemap File

Oct 11, 2010 • 8:53 am | comments (4) by twitter Google+ | Filed Under Google Search Engine Optimization
 

I'll be honest, I am pretty technical, but I simply do not fully understand the issue here.

Franz Enzenhofer posted a thread at Google Webmaster Help claiming Google cannot access his XML Sitemap no matter what he does. After much debugging and testing, he wrote a blog post explaining why he and maybe your servers won't serve up your XML sitemap file to Google.

He wasn't able to see a GET request by Google, so he dug even deeper and noticed an issue at the TCP/IP level. He said, and I quote:

  • You realize that googlebot makes multiple (we counted up to 11) GET requests in one single TCP/IP connection. (which is OK according to the HTTTP 1.1 spec).
  • You realize (with the help of stackoverflow) that these multiple GET requests in the some TCP/IP connection are processed in sequence (one after the other).
  • You realize that if one these GET requests has a major time lag (is much slower than the other GET requests) Google cuts the TCP/IP connection.
  • Because all the GET requests in the connection were processed in sequence, all the GET requests after the cut are lost. You don’t see them in the error/access logs as they were never processed, even though they were sent.
  • You see an error in Google Webmaster Tools, without a trace in your logfiles.

Google's Matt Cutts commented on the blog post saying, "Interesting--good find." He didn't necessarily confirm this is a Google bug but JohnMu commented in the Google Webmaster Help thread implying that GoogleBot is stopping after some time due to his server speed? At least I think that is what he is saying:

It looks like you were busy while I was out of town :). Yes, this error can mean that we did not even try to fetch the Sitemap file from your server. If we recently picked up some of the Sitemap files again, I assume this is likely just a coincidence. With your site, we'd likely need to crawl with more than 10 QPS (the maximum manual setting in the crawl rate tool); by changing it to "Let Google determine my crawl rate (recommended)" our systems would be able to do that if they're able to determine that your server can handle it.

Forum discussion at Google Webmaster Help.

Previous story: Don't Use Pipes In Your URLS | Says Google
 

Comments:

John Hughes

10/11/2010 03:20 pm

If you are having such a problem, would a solution not be to break your sitemap into smaller files, and serve a "sitemap of sitemaps" on the main sitemap URL, which is allowed in the specification? It seems that server response speed, or file download size might be the cause of the issue described.

franz

10/11/2010 06:57 pm

hi ho franz here a good technical discussion can be found here http://news.ycombinator.com/item?id=1774847 googlebot on a tcp / ip level the simple one: googlebot does not send one GET request after the other, but: googlebot open one connection and mutliple GET requests in one go (one TCP/IP connection) lets say a package contains 10 GET requests 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 the server respons on each of those requests in that package in sequence 1 ... fast response 2 ... fast response 3 .... fast response 4. .... fast response 5 .... very very slow response googlebot cuts the connection the other request 6, 7, 8, 9, 10 were not processed- they do not show up in the logfiles of the server. BUT if they were a sitemap XML request they show up as errors in the google webmaster tools - as they were sent, but nothing came back. what does this mean: an - overall (all(!!) site sections) - fast and reliable site is very important. the outcome is not that of big news, but the investigation was cool.

Amit

10/12/2010 09:59 am

I have read the whole post.Can someone tell me that how to check that google is accessing my XML sitemap.Is it by using google webmaster central tool.

Amit

10/12/2010 10:01 am

I have read the whole post.Can someone tell me that how to check that google is accessing my XML sitemap.Is it by using google webmaster central tool.

blog comments powered by Disqus