Gary Illyes from Google described how search engine crawlers have changed over the years. This came up in the latest Search Off the Record podcast with Martin Splitt and Gary Illyes from Google.
He also said that while Googlebot does not support HTTP3 yet, they will eventually because it is more efficient.
It has changed in a few ways including:
(1) Pre and post HTTP headers was a change
(2) The robots.txt protocol (although that is super super old)
(3) Dealing with spammers and scammers
(4) How AI is consuming more stuff now (kinda).
This came up at the 23:23 mark into the podcast, here is the embed:
Martin Splitt asked Gary: "Do you see a change in the way that crawlers work or behave over the years?"
Gary replied:
Behave, yes. How they crawl, there's probably not that much to change. Well, I guess back in the days we had, what, HTTP/1.1, or probably they were not crawling on /0.9 because no headers and stuff, like that's probably hard. But, anyway, nowadays you have h2/h3. I mean, we don't support h3 at the moment, but eventually, why wouldn't we? And that enables crawling much more efficiently because you can stream stuff--stream, meaning that you open one connection and then you just do multiple things on that one connection instead of opening a bunch of connections. So like the way the HTTP clients work under the hood, that changes, but technically crawling doesn't actually change.
He then added:
And then how different companies set policies for their crawlers, that of course differs greatly. If you are involved in discussions at the IETF, for example, the Internet Engineering Task Force, about crawler behavior, then you can see that some publishers are complaining that crawler X or crawler B or crawler Y was doing something that they would have considered not nice. The policies might differ between crawler operators, but in general, I think the well-behaved crawlers, they would all try to honor robots.txt, or Robots Exclusion Protocol, in general, and pay some attention to the signals that sites give about their own load or their servers load and back out when they can. And then you also have, what are they called, the adversarial crawlers like malware scanners and privacy scanners and whatnot. And then you would probably need a different kind of policy for them because they are doing something that they want to hide. Not for a malicious reason, but because malware distributors would probably try to hide their malware if they knew that a malware scanner is coming in, let's say. I was trying to come up with another example, but I can't. Anyway. Yeah. What else do you have?
He added later:
Yeah. I mean, that's one thing that we've been doing last year, right? Like, we were trying to reduce our footprint on the internet. Of course, it's not helping that then new products are launching or new AI products that do fetching for various reasons. And then basically you saved seven bytes from each request that you make. And then this new product will add back eight. The internet can handle the the load from from crawlers. I firmly believe that--this will be controversial and I will get yelled at on the internet for this--but it's not crawling that is eating up the resources; it's indexing and potentially serving or what you are doing with the data when you are processing that data that you fetch, that's what's expensive and resource-intensive. Yeah, I will stop there before I get in more trouble.
I mean, not much has changed but listening this wasn't too bad (looking at you Gary).
Forum discussion at LinkedIn.
Image credit to Lizzi