Myriam Jessier asked Google about what would be good attributes of a web crawler. In which both Martin Splitt and Gary Illyes gave some responses to.
Myriam Jessier asked on Bluesky, "what are the good attributes? One should look into when picking a crawler to check things on a site for SEO and gen AI search?"
Martin Splitt from Google replied with this list of attributes:
- support http/2
- declare identity in the user agent
- respect robots.txt
- backoff if the server slows
- follow caching directives*
- reasonable retry mechanisms
- follow redirects
- handle errors gracefully*
Gary Illyes from Google forwarded the conversation to a new IETF document that talks about Crawler best practices. Gary wrote that this document was posted a few weeks ago.
It covers the recommended best practices including:
- Crawlers must support and respect the Robots Exclusion Protocol.
- Crawlers must be easily identifiable through their user agent string.
- Crawlers must not interfere with the regular operation of a site.
- Crawlers must support caching directives.
- Crawlers must expose the IP ranges they are crawling from in a standardized format.
- Crawlers must expose a page that explains how the crawled data is used and how it can be blocked.
Check out that full document over here - you can see that Gary Illyes co-authored it but not under Google's name.
Forum discussion at Bluesky.
Image credit to Lizzi