Google's Gary Illyes On Crawl Budget, Scheduling & Host Load

May 17, 2016 - 8:47 am 6 by

Brick Google Bot 1900px

During the Stone Temple Consulting Q&A with Gary Illyes on YouTube the other week, he spoke about crawl budget and how Google doesn't necessarily crawl based on crawl budget. He said internally, Google calls the crawl scheduling "host load." Host load kind of sets a bucket of URLs in importance order and GoogleBot will crawl in that order based on the schedule the host load decided. If Google thinks your server can handle it, it will crawl the whole bucket, if not, it will stop.

He was asked a question on crawl budget at the 17:25 minute mark by Eric Enge who asked, "Historically, people have talked about Google having a crawl budget. Is that a correct notion, like Google comes in they're going to take 327 pages from your site today."

Gary Illyes responded and it was hard to transcribe...

It's not like, like how many pages do we want to crawl. We have a notion internally which we call host load. And that's basically how much do we think site can handle from us. But it's not based on a number of pages. It's more like what's the limit or was a threshold where we're down or after which the server becomes slower for example or stuff like that.

I think what you are talking about is actually scheduling. Basically home how many pages do we ask from indexing side to be crawled by Googlebot. That is driven mainly by the importance of the pages on at site but not by the number of URLS or how many URLS you want to crawl. It doesn't have anything to do with host load, it is more like this is just an example but for example. This URL lives in a sitemap then we will probably want to crawl it because or crawl it sooner or more often because you deem that page more important by putting in a sitemap. We can also learn that this might not be true when sitemaps are automatically generated and like for every single URL there is a URL in the sitemap and then we will use other signals. For example high PageRank URLs probably should be crawled more often and we have a bunch of other signals that we use. But basically the more important the URL is the more often will be recrawled. And once it is recrawled, the bucket of URLs of high importance URLs…

Every day but it's probably not a day, we create a bucket of URLS that we want to crawl from a site and we filled a bucket with URLS sorted by the signals that we use for scheduling, sitemaps or PageRank whatever, and then front he top, we start crawling and crawling and crawling. Then if we see that the server, if we can finish the bucket fine. If we see that the server slow down when will not.

I recommend you listen to it yourself, I am embedding it at the start time:

I did find one document from Google on host load related to their search appliance, it says:

The Web Server Host Load value specifies the maximum number of concurrent connections opened for crawling between the search appliance and each web server during any one-minute period. The default number of concurrent connections is four. Google recommends that you start four connections, then increase the value after you determine that your web or file servers have sufficient capacity for a higher load. Consult the administrator whose sites the search appliance crawls to determine a server's load capacity.

It goes on and on, so it is an interesting read.

Forum discussion at YouTube.

 

Popular Categories

The Pulse of the search community

Follow

Search Video Recaps

 
Video Details More Videos Subscribe to Videos

Most Recent Articles

Search Forum Recap

Daily Search Forum Recap: May 28, 2024

May 28, 2024 - 10:00 am
Google Updates

Memorial Day Google Search Ranking Volatility

May 28, 2024 - 7:51 am
Google Search Engine Optimization

Google's John Mueller On Recovering From Core Updates - Maybe You Had A Good Run...

May 28, 2024 - 7:41 am
Google Ads

Undated Google Ads Experiments To End August 23, 2024

May 28, 2024 - 7:31 am
Google

Google Tests Thin Top Deals Search Bar

May 28, 2024 - 7:21 am
Google Search Engine Optimization

Report: 14,000+ Google Search Ranking Features Leaked

May 28, 2024 - 6:15 am
Previous Story: Google Search Console Checks For Apple Universal Links For App Association