What SEO Geeks Argue About

Feb 13, 2012 • 8:59 am | comments (21) by twitter Google+ | Filed Under SEO - Search Engine Optimization
 

SEO Geeks ArgueThis morning, I spotted a group of SEOs arguing about if a URL can be "indexed" if the page is blocked by a noindex tag or robots.txt file?

It is a valid SEO question but when you ask that in a room of SEO geeks, the responses can get pretty wild.

You've all seen examples of URLs in the search results that just list the URL but not the actual title tag and snippet of the page. That is typically because Google has a copy of the URL in their database as a reference but has not crawled or indexed the content on the web page because it is restricted to do so for one reason or another.

The question is, is that URL considered indexed or not? That depends on the definition of "indexed" and which SEO you ask.

Let me share the tweets about this:

The discussion went on for dozens and dozens of tweets with no one winning.

Matt or John, want to chime in and give the final answer?

Forum discussion at, um, Twitter.

Previous story: Will Socially Awkward Sites Suffer In Google In The Long Run?
 

Comments:

m_j_taylor

02/13/2012 02:07 pm

I've seen pages that were ostensibly blocked that were cached ... which means they must be crawled, but who am I to argue? 

rjonesx

02/13/2012 02:12 pm

Discussions like this make me want to punch everyone in the face. Including myself.

rjonesx

02/13/2012 02:16 pm

Google's robots.txt handling is imperfect (so is every bots). For example, they don't re-check robots.txt every time they crawl a URL on your site. Instead, they check robots.txt with some level of frequency (probably a mix on the frequency with which you have updated it in past, the discovery of new directories, number of URLs queued, etc.), and then apply that to the URLs they have queued to spider. If the robots.txt changes between that time, they will still spider and index/cache those URLs.

AndyBeard

02/13/2012 02:16 pm

Here Matt used the terminology "Uncrawled reference" http://www.stonetemple.com/articles/interview-matt-cutts.shtml  Just to confuse matters John Mu does say "for indexing if we don't crawl" https://plus.google.com/107576957488923607021/posts/PCqaNsmsabz  I think what Matt says is much cleaner... why? Because then if a URL is "indexed" it means it has been crawled, and then a meta "noindex" or x-robots noindex can remove it.

Autocrat (Lyndon NA)

02/13/2012 03:00 pm

robots.txt is generally cached for 24+ hours. G will check their last reference rather than constantly checking it. If you want to ensure something is Not indexed - you have to let G crawl it. That means robot.txt is of no real use to prevent indexation. If it shows in the SERPs - it's in G's index. This is regardless of whether the actual file has been crawled or not. G may pick up the URL, it may utilise inbound link text, it may utilise content from linking sources etc. ... but it will not use the content from the page as it was not crawled. If G actually took a directive in robots.txt to mean "do not index" as well as "do not crawl", this wouldn't be a problem! . Question - how is it that in this day/age - after so many years of this sort of behaviour from G, is it still an "issue"?

Oli

02/13/2012 04:03 pm

If a reference link has not been crawled it has not been indexed, but it can still appear in the results. Chances are though you will never see one, why?  Google has zero content for it. A lot of the argument here is simply over syntax.

Jeff Downer Indianapolis IN

02/13/2012 04:41 pm

I always assume Google crawls everything, then sorts it it out later.  Seems safer to think that way to me.

SEO Catalysts

02/13/2012 04:46 pm

Discussion is very long and put me with headache... Can't understand what they want to tell at the end

John Britsios

02/13/2012 05:22 pm

Here is a video with Matt Cutts where he explains clearly what the case is: http://www.youtube.com/watch?v=KBdEwpRQRD0

Ann Smarty

02/13/2012 08:17 pm

I still don't agree with that, Andy, sorry: "Because then if a URL is "indexed" it means it has been crawled" I am absolutely fine with calling it referencing but I won't agree that the URL should be crawled to be indexed...

Ann Smarty

02/13/2012 08:22 pm

This is GOLD: "If you want to ensure something is NOT indexed - you have to let G crawl it. That means robot.txt is of no real use to prevent indexation." I remember some issue discussed a few years ago: an URL was "referenced" in search results despite NOINDEX meta tag. It turned out that Google COULD NOT see the meta tag because the Robots.txt file prevented it from crawling the page to see it! This is why "crawling" and "indexing" should not be mixed up. And this is why this whole discussion MAKES perfect sense! Thanks again for bring it up, Barry!

Lapworth Architects

02/13/2012 08:46 pm

The video by Matt Cutts is very useful.. Almost a diametrical oppisite very very confusing for a novice seo!

Autocrat (Lyndon NA)

02/13/2012 08:50 pm

This was a somewhat common issue of confusion in the Google Webmaster Forums ... it came up a few times most Months, and took ages to get through to people. In the end, I made an Auto-Response to cover it to save typing/repeating myself. Please excuse the typos, spellings, grammar and sarcasm (unlike some, I'm not a writer :D); http://www.google.com/support/forum/p/Webmasters/thread?tid=6ca03e52daa66fc3&hl=en If only more peopel grasped the little differences - it would solve so much confusion.

Falko Luedtke

02/13/2012 09:41 pm

Okay I will jump into this cause is one of my favorite topics. Lets look at the Terms: Crawling = A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots,[1] Web spiders,[2] Web robots,[2] or—especially in the FOAF community—Web scutters. http://en.wikipedia.org/wiki/Web_crawler Indexing = Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and computer science. An alternate name for the process in the context of search engines designed to find web pages on the Internet is web indexing. http://en.wikipedia.org/wiki/Index_(search_engine) Googlebot[35] is described in some detail, but the reference is only about an early version of its architecture, which was based in C++ and Python. The crawler was integrated with the indexing process, because text parsing was done for full-text indexing and also for URL extraction. There is a URL server that sends lists of URLs to be fetched by several crawling processes. During parsing, the URLs found were passed to a URL server that checked if the URL have been previously seen. If not, the URL was added to the queue of the URL server.  On Google - The crawler was integrated with the indexing process, because text parsing was done for full-text indexing and also for URL extraction. What means when a page gets crawled it gets indexed into the DB cause it was read by the bot eg crawled. At this point the Algorithms take over and check for "Signs" that tell Google not to show a page in the search result and does not make it available in the "Search Result Index" e.g the Supplemental Index or Main Index. So actually everybody is kind of right, it is just the vantage point you are looking at it :) Cheers Falko

Tony McCreath (Tiggerito)

02/14/2012 12:24 am

Everyone seems to have their own understanding on what INDEX means and so it's not surprising we have confusion!  So I'll add my take :-) Coming from a software background, I would consider that a URL is indexed as soon as it has entered the system. i.e. as soon as a link to a URL was found. That index would change state as more information is discovered, like when the URL is later crawled. One flag would be to exclude the index from search results (meta robots), another may be to prohibit crawling of the URL in question (robots.txt). A very basic index with only the URL can be called a reference.

Dicebat

02/14/2012 01:02 pm

No brainer. If it's indexed without displaying a proper snippet, it's indexed but not crawled. If it displays meta info or Google generated info in the snippet, it's been crawled and indexed.

Matt Bennett

02/14/2012 04:10 pm

I pulled out of these sort of discussions a few years ago because of the never ending aspect they have. It's basically a discussion about semantics!

Matt Bennett

02/14/2012 04:35 pm

http://www.youtube.com/watch?v=J2oEmPP5dTM :-)

Aashish Sahrawat

02/15/2012 09:17 am

Yes, A URL can be indexed although it is blocked by robots or by other means. This happens when that URL is placed on other websites and G discover that URL through other website. 

Neil Lance Fessler

04/10/2012 06:56 am

For me, indexed = first crawl and crawling is the continuous checking of bots per update. cache is involve (the snapshot when was the last crawl takes place) :D

Spook SEO

05/18/2014 05:23 pm

Hello Barry! I believe that if a URL is indexed then it has been crawled. I think everyone has their own understanding about this certain topic. It is also nice to hear the opinion of other SEO experts.

blog comments powered by Disqus