Found a nice thread at Cre8asite Forums that is good for anyone to catch up on. It concerns the indexing limits Google has for certain websites and pages. When does Google decide to stop spidering all your pages? Why does it just grab the most important pages first?
Having had good experience with getting large numbers of pages spidered. For some large part in my opinion, discovery date and how effectively linked internally and externally pages are what help assign some importance to certain pages and which keep Googlebot coming back for more daily.
The original poster of the thread is trying to understand by Google only spider 20% of his 800 pages. The first clue from the thread is his comment relating to the following: "we only seem to have about 100 pages indexed." Okay that is a start.
Next clue: "We have added about 2000 pages recently and changed the menu". Hmmm... that would probably have something to do with it.
And finally: "the only theory i can think of is the amount of links on a page as the menu alone is about 100 links"
Well, being that Google has spidered only 100 pages, and there is only 100 links in the navigation menu. I would say that Google is not having trouble listing any of the pages, it just can't find them all! This comes down to an information architecture problem relating specifically to the menu organization and IMO some beliefs that might be limiting the webmaster from fully utilizing the navigation. The first mythbuster, is that you can have more than 100 links in a navigation menu and get by just fine. The prevailing thought for a long time was that Google would only spider the first 100 links, and any more was risk for penalty. Not true anymore, times have changed. However, there is still inherent problems with more than 100 links, such as page size which can cap the amount of spiderable links and so on.
In order to understand a bit more about how Google spiders pages and which ones they favor the most. Admin on Cre8asiteforums, bragadocchio, posted some excepts from the Google document called Efficient Crawling Through URL Ordering.
In this paper we study in what order a crawler should visit the URLs it has seen, in order to obtain more "important" pages first. Obtaining important pages rapidly can be very useful when a crawler cannot visit the entire Web in a reasonable amount of time. We define several importance metrics, ordering schemes, and performance evaluation measures for this problem. We also experimentally evaluate the ordering schemes on the Stanford University Web. Our results show that a crawler with a good ordering scheme can obtain important pages significantly faster than one without.
Bragadocchio goes on to define what is said a little more, "Importance metrics, like those defined in the paper, can be combined, so on a site that has a number of pages with higher pageranks, or more inbound links, those might help combat the weakness of a page like that when it comes to a importance metric based upon location and distance from the root directory."
Excellent thread, for continued discussion about Google Indexing Limits visit Cre8asite Forums.