如何验证谷歌抓取机器人(Googlebot)
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
2008年2月24日星期日
发表者 Matt Cutts, 软件工程师
原文:
How to verify Googlebot
发表于: 2006年9月20日,周三, 上午11时45分
最近我听到一些
聪明
人士
要求搜索引擎提供一种方法来验证一个抓取机器人是正宗的。毕竟,任何垃圾制造者都可以用Googlebot来命名他们的抓取机器人,并声称自己是Google的。那么,你应该信任哪些抓取机器人,又应该阻截哪些?
我们听到最普遍的要求是把Googlebot的IP地址列表公布给大家。这个做法的问题是,如果/当我们的抓取工具的IP地址范围改变时,并非每个人都知 道去检查。事实上,爬行组几年前搬迁过Googlebot的IP地址,他们遇到的一个真正的麻烦是提醒一些把Googlebot的IP范围写在他们的程序 里的网管们。所以爬行组的成员们提供了另一种方法来验证Googlebot。这里是爬行组成员们提供的一个答案(经他们同意在此引述) :
请告诉网站管理员们,最好的方法看来是使用域名解析服务器(DNS)来核实每个案例。我推荐的验证技术是做反向DNS查找,核实该名字是在googlebot.com域名内,然后使用该googlebot.com名做一个相应的正向DNS->IP的查找; 例如:
(译者注:以下是Linux命令及执行结果)
> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.
(1.66.249.66.in-addr.arpa域名指针crawl-66-249-66-1.googlebot.com)
> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1
(crawl-66-249-66-1.googlebot.com的IP地址是66.249.66.1)
我认为只做反向DNS查找是不够的,因为一个垃圾制造者可以建立反向的DNS来指向crawl-a-b-c-d.googlebot.com。
这个答案也是我们内部的技术帮助中心提供给我的,所以我认为这是一个验证Googlebot的官方方法。为了从“官方的”Googlebot IP范围内抓取,抓取机器人要尊重robots.txt和我们内部的主机负荷惯例,从而使Google不过分爬行您的网站。
(感谢N.和J.为此文提供的帮助,他们介绍了爬行方面涉及的东西) 。
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2008-02-01。
[null,null,["最后更新时间 (UTC):2008-02-01。"],[[["\u003cp\u003eThis blog post may contain outdated information, including broken links or missing images.\u003c/p\u003e\n"],["\u003cp\u003eGoogle recommends verifying Googlebot by performing a reverse DNS lookup followed by a forward DNS lookup to confirm the IP address and hostname match.\u003c/p\u003e\n"],["\u003cp\u003eSimply relying on reverse DNS lookup isn't enough, as spammers can potentially spoof it.\u003c/p\u003e\n"],["\u003cp\u003eGooglebot operates within official IP ranges and adheres to robots.txt rules and hostload conventions to avoid overloading websites.\u003c/p\u003e\n"],["\u003cp\u003eWebmasters can use these verification methods to ensure that only legitimate Googlebot accesses their sites.\u003c/p\u003e\n"]]],["To authenticate Googlebot, use DNS verification. Perform a reverse DNS lookup to confirm the name is within the `googlebot.com` domain. Subsequently, conduct a forward DNS-to-IP lookup using the `googlebot.com` name. This two-step process ensures the bot's authenticity. Merely doing a reverse DNS isn't enough, as it can be spoofed. Google recommends this method instead of maintaining a public list of IP addresses. Googlebot will also follow robots.txt.\n"],null,["# How to verify Googlebot\n\n| It's been a while since we published this blog post. Some of the information may be outdated (for example, some images may be missing, and some links may not work anymore). Learn how you can [Verify Googlebot](/search/docs/crawling-indexing/verifying-googlebot).\n\nWednesday, September 20, 2006\n\n\nLately I've heard a couple smart\n[people](https://www.crazyguyonabike.com/) ask that\nsearch engines provide a way know that a bot is authentic. After all, any spammer could name\ntheir bot \"Googlebot\" and claim to be Google, so which bots do you trust and which do you block?\n\n\nThe common request we hear is to post a list of Googlebot IP addresses in some public place. The\nproblem with that is that if/when the IP ranges of our crawlers change, not everyone will know\nto check. In fact, the crawl team migrated Googlebot IPs a couple years ago and it was a real\nhassle alerting webmasters who had hard-coded an IP range. So the crawl folks have provided\nanother way to authenticate Googlebot. Here's an answer from one of the crawl people (quoted\nwith their permission):\n\u003e\n\u003e Telling webmasters to use DNS to verify on a case-by-case basis seems like the best way to go. I\n\u003e think the recommended technique would be to do a reverse DNS lookup, verify that the name is in\n\u003e the googlebot.com domain, and then do a corresponding forward DNS-\\\u003eIP lookup using that\n\u003e googlebot.com name. For example: \n\u003e\n\u003e ```\n\u003e \u003e host 66.249.66.1\n\u003e 1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.\n\u003e\n\u003e \u003e host crawl-66-249-66-1.googlebot.com\n\u003e crawl-66-249-66-1.googlebot.com has address 66.249.66.1\n\u003e ```\n\u003e\n\u003e\n\u003e I don't think just doing a reverse DNS lookup is sufficient, because a spoofer could set up\n\u003e reverse DNS to point to `crawl-a-b-c-d.googlebot.com`.\n\n\nThis answer has also been provided to our help-desk, so I'd consider it an official way to\nauthenticate Googlebot. In order to fetch from the \"official\" Googlebot IP range, the bot has to\nrespect robots.txt and our internal hostload conventions so that Google doesn't crawl you too\nhard.\n\n\n(Thanks to N. and J. for help on this answer from the crawl side of things.)"]]