During the Stone Temple Consulting Q&A with Gary Illyes on YouTube the other week, he spoke about crawl budget and how Google doesn't necessarily crawl based on crawl budget. He said internally, Google calls the crawl scheduling "host load." Host load kind of sets a bucket of URLs in importance order and GoogleBot will crawl in that order based on the schedule the host load decided. If Google thinks your server can handle it, it will crawl the whole bucket, if not, it will stop.
He was asked a question on crawl budget at the 17:25 minute mark by Eric Enge who asked, "Historically, people have talked about Google having a crawl budget. Is that a correct notion, like Google comes in they're going to take 327 pages from your site today."
Gary Illyes responded and it was hard to transcribe...
It's not like, like how many pages do we want to crawl. We have a notion internally which we call host load. And that's basically how much do we think site can handle from us. But it's not based on a number of pages. It's more like what's the limit or was a threshold where we're down or after which the server becomes slower for example or stuff like that.
I think what you are talking about is actually scheduling. Basically home how many pages do we ask from indexing side to be crawled by Googlebot. That is driven mainly by the importance of the pages on at site but not by the number of URLS or how many URLS you want to crawl. It doesn't have anything to do with host load, it is more like this is just an example but for example. This URL lives in a sitemap then we will probably want to crawl it because or crawl it sooner or more often because you deem that page more important by putting in a sitemap. We can also learn that this might not be true when sitemaps are automatically generated and like for every single URL there is a URL in the sitemap and then we will use other signals. For example high PageRank URLs probably should be crawled more often and we have a bunch of other signals that we use. But basically the more important the URL is the more often will be recrawled. And once it is recrawled, the bucket of URLs of high importance URLs…
Every day but it's probably not a day, we create a bucket of URLS that we want to crawl from a site and we filled a bucket with URLS sorted by the signals that we use for scheduling, sitemaps or PageRank whatever, and then front he top, we start crawling and crawling and crawling. Then if we see that the server, if we can finish the bucket fine. If we see that the server slow down when will not.
I recommend you listen to it yourself, I am embedding it at the start time:
I did find one document from Google on host load related to their search appliance, it says:
The Web Server Host Load value specifies the maximum number of concurrent connections opened for crawling between the search appliance and each web server during any one-minute period. The default number of concurrent connections is four. Google recommends that you start four connections, then increase the value after you determine that your web or file servers have sufficient capacity for a higher load. Consult the administrator whose sites the search appliance crawls to determine a server's load capacity.
It goes on and on, so it is an interesting read.
Forum discussion at YouTube.