Google 的抓取工具和抓取器可在数千台计算机上同时运行,以提高性能并随着网络规模的扩大而扩展其作用范围。为了优化带宽使用情况,这些客户端会分布在全球各地的许多数据中心,以便位于它们可能会访问的网站附近。因此,您的日志可能会显示来自多个 IP 地址的访问。Google 主要会从美国境内的 IP 地址发出请求。如果 Google 检测到某个网站屏蔽了来自美国的请求,则可能会尝试从位于其他国家/地区的 IP 地址进行抓取。
支持的传输协议
Google 抓取工具支持 HTTP/1.1 和 HTTP/2。抓取工具将使用可提供最佳抓取性能的协议版本,并且可能会在抓取会话之间切换协议,具体取决于之前的抓取统计信息。Google 抓取工具使用的默认协议版本为 HTTP/1.1;通过 HTTP/2 抓取可以为网站和 Googlebot 节省计算资源(例如 CPU、RAM),但不会为网站带来任何 Google 产品专属优势(例如,不会在 Google 搜索中提升排名)。
如需禁止通过 HTTP/2 抓取,请对托管您网站的服务器做出以下指示:当 Google 尝试通过 HTTP/2 访问您的网站时,返回 421 HTTP 状态代码。如果这种方法不可行,您可以向抓取团队发送消息(但这只是临时解决方案)。
Google 抓取工具基础架构还支持通过 FTP(如 RFC959 及其更新所定义)和 FTPS(如 RFC4217 及其更新所定义)进行抓取,但通过这些协议进行抓取的情况很少发生。
支持的内容编码
Google 的抓取工具和抓取器支持以下内容编码(压缩)方式:gzip、deflate 和 Brotli (br)。每个 Google 用户代理支持的内容编码都会在其发出的每个请求的 Accept-Encoding 标头中进行通告。例如:Accept-Encoding: gzip, deflate, br。
抓取速度和主机负载
我们的目标是,每次访问您的网站时都尽可能多地抓取网页,但不会过多地占用服务器的带宽。如果您的网站跟不上 Google 的抓取请求频率,您可以减慢抓取速度。
请注意,向 Google 抓取工具发送不适当的 HTTP 响应代码可能会影响您的网站在 Google 产品中的呈现效果。
[null,null,["最后更新时间 (UTC):2025-08-04。"],[[["\u003cp\u003eGoogle uses crawlers and fetchers, categorized as common, special-case, and user-triggered, to automatically discover and scan websites or make single requests on behalf of users.\u003c/p\u003e\n"],["\u003cp\u003eGoogle's crawlers and fetchers, distributed globally for optimized performance, primarily egress from US IP addresses and support HTTP/1.1, HTTP/2, FTP, and FTPS protocols for content access.\u003c/p\u003e\n"],["\u003cp\u003eGoogle aims for efficient crawling without overloading servers and supports content encodings like gzip, deflate, and Brotli, while also respecting robots.txt rules for automatic crawls.\u003c/p\u003e\n"],["\u003cp\u003eGoogle utilizes HTTP caching mechanisms, primarily ETag and Last-Modified headers, to minimize redundant data transfer and improve crawling efficiency.\u003c/p\u003e\n"],["\u003cp\u003eGoogle's crawlers can be verified through their user-agent, source IP address, and reverse DNS hostname, ensuring authenticity and security.\u003c/p\u003e\n"]]],["Google's crawlers, which automatically discover and scan websites, and fetchers, which make single requests, serve Google products. Clients are categorized as common crawlers, special-case crawlers (with site-specific agreements), and user-triggered fetchers. They operate from global datacenters, use HTTP/1.1 or HTTP/2, and support gzip, deflate, and Brotli compression. Crawl rates can be adjusted to prevent server overload. Caching, via ETag and Last-Modified headers, is supported to optimize crawling efficiency. To identify a google crawler, use HTTP user-agent, source IP address, and the reverse DNS hostname.\n"],null,["# Google Crawler (User Agent) Overview | Google Search Central\n\nOverview of Google crawlers and fetchers (user agents)\n======================================================\n\n\nGoogle uses crawlers and fetchers to perform actions for its products, either automatically or\ntriggered by user request. Crawler (sometimes also called a \"robot\" or \"spider\") is a generic term\nfor any program that is used to\n[automatically discover and scan websites](/search/docs/fundamentals/how-search-works#crawling).\nFetchers act as a program like\n[wget](https://www.gnu.org/software/wget/) that typically make a\nsingle request on behalf of a user. Google's clients fall into three categories:\n\n|------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| [Common crawlers](/search/docs/crawling-indexing/google-common-crawlers) | The common crawlers used for Google's products (such as [Googlebot](/search/docs/crawling-indexing/googlebot)). They always respect robots.txt rules for automatic crawls. |\n| [Special-case crawlers](/search/docs/crawling-indexing/google-special-case-crawlers) | Special-case crawlers are similar to common crawlers, however are used by specific products where there's an agreement between the crawled site and the Google product about the crawl process. For example, `AdsBot` ignores the global robots.txt user agent (`*`) with the ad publisher's permission. |\n| [User-triggered fetchers](/search/docs/crawling-indexing/google-user-triggered-fetchers) | User-triggered fetchers are part of tools and product functions where the end user triggers a fetch. For example, [Google Site Verifier](https://support.google.com/webmasters/answer/9008080) acts on the request of a user. |\n\nTechnical properties of Google's crawlers and fetchers\n------------------------------------------------------\n\n\nGoogle's crawlers and fetchers are designed to be run simultaneously by thousands of machines to\nimprove performance and scale as the web grows. To optimize bandwidth usage, these clients are\ndistributed across many datacenters across the world so they're located near the sites that they\nmight access. Therefore, your logs may show visits from several IP addresses.\nGoogle egresses primarily from IP addresses in the United States. In case Google detects that a\nsite is blocking requests from the United States, it may attempt to crawl from IP addresses\nlocated in other countries.\n\n### Supported transfer protocols\n\n\nGoogle's crawlers and fetchers support HTTP/1.1 and\n[HTTP/2](https://en.wikipedia.org/wiki/HTTP/2). The crawlers will\nuse the protocol version that provides the best crawling performance and may switch protocols\nbetween crawling sessions depending on previous crawling statistics. The default protocol\nversion used by Google's crawlers is HTTP/1.1; crawling over HTTP/2 may save computing resources\n(for example, CPU, RAM) for your site and Googlebot, but otherwise\nthere's no Google-product specific benefit to the site (for example, no ranking boost in Google Search).\nTo opt out from crawling over HTTP/2, instruct the server that's hosting your site to respond\nwith a `421` HTTP status code when Google attempts to access your site over\nHTTP/2. If that's not feasible, you\n[can send a message to the Crawling team](https://www.google.com/webmasters/tools/googlebot-report)\n(however this solution is temporary).\n\n\nGoogle's crawler infrastructure also supports crawling through FTP (as defined by\n[RFC959](https://datatracker.ietf.org/doc/html/rfc959) and its\nupdates) and FTPS (as defined by\n[RFC4217](https://datatracker.ietf.org/doc/html/rfc4217) and its\nupdates), however crawling through these protocols is rare.\n\n### Supported content encodings\n\n\nGoogle's crawlers and fetchers support the following content encodings (compressions):\n[gzip](https://en.wikipedia.org/wiki/Gzip),\n[deflate](https://en.wikipedia.org/wiki/Deflate), and\n[Brotli (br)](https://en.wikipedia.org/wiki/Brotli). The\ncontent encodings supported by each Google user agent is advertised in the\n`Accept-Encoding` header of each request they make. For example,\n`Accept-Encoding: gzip, deflate, br`.\n\n### Crawl rate and host load\n\n\nOur goal\nis to crawl as many pages from your site as we can on each visit without overwhelming your\nserver. If your site is having trouble keeping up with Google's crawling requests, you can\n[reduce the crawl rate](/search/docs/crawling-indexing/reduce-crawl-rate). Note that\nsending the inappropriate\n[HTTP response code](/search/docs/crawling-indexing/http-network-errors)\nto Google's crawlers may affect how your site appears in Google products.\n\n### HTTP Caching\n\n\nGoogle's crawling infrastructure supports heuristic HTTP caching as defined by the\n[HTTP caching standard](https://httpwg.org/specs/rfc9111.html),\nspecifically through the `ETag` response- and `If-None-Match` request\nheader, and the `Last-Modified` response- and `If-Modified-Since` request\nheader.\n| Note: Consider setting both the `Etag` and `Last-Modified` values regardless of the preference of Google's crawlers. These headers are also used by other applications such as CMSes.\n\n\nIf both `ETag` and `Last-Modified` response header fields are present in the\nHTTP response, Google's crawlers use the `ETag` value as\n[required by the HTTP standard](https://www.rfc-editor.org/rfc/rfc9110.html#section-13.1.3).\nFor Google's crawlers specifically, we recommend using\n[`ETag`](https://www.rfc-editor.org/rfc/rfc9110#name-etag)\ninstead of the `Last-Modified` header to indicate caching preference as\n`ETag` doesn't have date formatting issues.\n\n\nOther HTTP caching directives aren't supported.\n\n\nIndividual Google crawlers and fetchers may or may not make use of caching, depending on the needs\nof the product they're associated with. For example, `Googlebot` supports caching when\nre-crawling URLs for Google Search, and `Storebot-Google` only supports caching in\ncertain conditions.\n\n\nTo implement HTTP caching for your site, get in touch with your hosting or content management\nsystem provider.\n\n#### `ETag` and `If-None-Match`\n\n\nGoogle's crawling infrastructure supports `ETag` and `If-None-Match` as\ndefined by the\n[HTTP Caching standard](https://httpwg.org/specs/rfc9111.html).\nLearn more about the\n[`ETag`](https://www.rfc-editor.org/rfc/rfc9110#name-etag)\nresponse header and its request header counterpart,\n[`If-None-Match`](https://www.rfc-editor.org/rfc/rfc9110#name-if-none-match).\n\n#### Last-Modified and If-Modified-Since\n\n\nGoogle's crawling infrastructure supports `Last-Modified` and\n`If-Modified-Since` as defined by the\n[HTTP Caching standard](https://httpwg.org/specs/rfc9111.html)\nwith the following caveats:\n\n- The date in the `Last-Modified` header must be formatted according to the [HTTP standard](https://www.rfc-editor.org/rfc/rfc9110.html). To avoid parsing issues, we recommend using the following date format: \"Weekday, DD Mon YYYY HH:MM:SS Timezone\". For example, \"Fri, 4 Sep 1998 19:15:56 GMT\".\n- While not required, consider also setting the [`max-age` field of the `Cache-Control` response header](https://www.rfc-editor.org/rfc/rfc9111.html#name-max-age-2) to help crawlers determine when to recrawl the specific URL. Set the value of the `max-age` field to the expected number of seconds the content will be unchanged. For example, `Cache-Control: max-age=94043`.\n\n\nLearn more about the\n[`Last-Modified`](https://www.rfc-editor.org/rfc/rfc9110#name-last-modified)\nresponse header and its request header counterpart, [`If-Modified-Since`](https://www.rfc-editor.org/rfc/rfc9110#name-if-modified-since).\n\nVerifying Google's crawlers and fetchers\n----------------------------------------\n\n\nGoogle's crawlers identify themselves in three ways:\n\n1. The HTTP `user-agent` request header.\n2. The source IP address of the request.\n3. The reverse DNS hostname of the source IP.\n\n\nLearn how to use these details to\n[verify Google's crawlers and fetchers](/search/docs/crawling-indexing/verifying-googlebot)."]]