与谷歌机器人的第一次约会:标头和压缩
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
2008年3月26日星期三
发表者:Maile Ohye (饰网站),Jeremy Lilley (饰谷歌机器人)
原文:
First date with the Googlebot: Headers and compression
发表于: 2008年3月5日星期三,晚上6:13
谷歌机器人
-- 多么神奇的梦幻之舟!他了解我们的灵魂和各个组成部分。或许他并不寻求什么独一无二的东西;他阅览过其它数十亿个网站(虽然我们也与其他搜索引擎机器人分享自己的数据:)),但是就在今晚,作为网站和谷歌机器人,我们将真正地了解对方。
我知道第一次约会的时候,过分地分析从来就不是什么好主意。我们将通过一系列的文章,一点点地了解谷歌机器人:
-
我们的第一次约会(就在今晚):谷歌机器人发出的数据标头和他所留意到的文件格式是否适于被进行压缩处理;
-
判断他的反应:响应代码(301s、302s),他如何处理重定向和If-Modified-Since;
-
下一步:随着链接,让他爬行得更快或者更慢(这样他就不会兴奋地过了头)。
今晚只是我们的第一次约会……
***************
谷歌机器人: 命令正确应答
网站: 谷歌机器人,你来了!
谷歌机器人:是的,我来了!
GET / HTTP/1.1
Host: example.com
Connection: Keep-alive
Accept: */*
From: googlebot(at)googlebot.com
User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html)
Accept-Encoding: gzip,deflate
网站: 这些标头太炫了!无论我的网站在美国、亚洲还是欧洲,你都用同样的标头爬行吗?你曾经用过其他标头吗?
谷歌机器人: 一般而言,我在全球各地所用的标头都保持一致。我试图从一个网站默认的语言和设定出发,搞清楚一个网页究竟长得什么样。有时候人们的用户代理各不相同,例如Adsense读取使用的是“Mediapartners-Google”:
User-Agent: Mediapartners-Google
或者对于图像搜索:
User-Agent: Googlebot-Image/1.0
无线读取的用户代理因运营商而异,而谷歌阅读器RSS读取则包含了订阅者数量等额外信息。
我通常会避免Cookies(因此不存在所谓“Cookie:”标头),因为我并不希望与具体对话有关的信息对内容产生太大的影响。此外,如果某个服务器在动态URL而不是Cookies上使用对话ID,通常我都能识别出来,这样就不用因为每次对话ID的不同而成千上万遍地重复爬行同一个网页。
网站:我的结构非常复杂。我是用许多类型的文件。你的标头说:“Accept:*/*”。你会对所有的URL进行收录,还是自动过滤某些文件扩展名?
谷歌机器人:这要取决于我想找什么。
如果我只是对常规的Web搜索进行检索,当我看到指向MP3和视频内容的链接,我可能不会下载这些东西。类似地,如果我看到了一个JPG文件,处理方法自然 就与HTML或者PDF链接有所区别。例如JPG 的变动频率往往比HTML低很多,所以我不太经常检查JPG的变动,以节约带宽。同时,如果我为谷歌学术搜索寻找链接,那么我对PDF文章的兴趣就会远远高于对JPG文件的兴趣。对于学者而言,下载涂鸦绘画(例如JPG),或者是关于小狗玩滑板的视频,是容易让他们分散注意力的,你说对吗?
网站:没错,他们可能会觉得被打扰到了。你的敬业精神令我佩服得五体投地。我自己就喜欢涂鸦绘画(JPG),很难抗拒它们的诱惑力。
谷歌机器人:我也一样。实际上我并不是一直都在做学问。如果我为搜索图像而爬行,就会对JPG非常感兴趣,碰到新闻,我会花大力气考察HTML和它们附近的图像。
还有很多扩展名,例如exe、dll、zip、dmg等,它们对于搜索引擎而言,既数量庞大,又没有多大用处。
网站:如果你看到我的URL“https://www.example.com/page1.LOL111”,(呜噎着说)你会不会只是因为里面包含着未知的文件扩展名就把它拒之门外呢?
谷歌机器人: 网站老兄,让我给你讲点背景知识吧。一个文件真正下载完成后,我会使用“内容—类别”(Content-Type)标头来检查它属于HTML、图像、文本还是别的什么东西。如果它是PDF、Word文档或Excel工作表等特殊的数据类型,我会确认它的格式是否合法有效,并从中抽取文本内容。但是你永远也不能确定里面是否含有病毒。但是如果文档或数据类型混乱不清,我除了把它们扔掉之外,也没有什么更好的办法。
所以,如果我爬行你的“https://www.example.com/page1.LOL111”URL并发现未知文件扩展名时,我可能会首先把它下载。 如果我从标头中无法弄清内容类型,或者它属于我们拒绝检索的文件格式(例如MP3),那么只能把它放在一边了。除此之外,我们会接着对文件进行爬行。
网站:谷歌机器人,我很抱歉对你的工作风格“鸡蛋里挑骨头”,但我注意到你的“Accept-Encoding”标头这样说:
Accept-Encoding: gzip,deflate
你能跟我说说这些标头是怎么回事吗?
谷歌机器人:当然。所有的主流搜索引擎和WEB浏览器都支持对内容进行gzip压缩,以节约带宽。你或许还会碰到其它的一些类型,例如“x-gzip”(与“gzip”相同),“deflate”(我们也支持它)和“identity”(不支持)。
网站:你能更详细地说说文件压缩和“Accept-Encoding: gzip,deflate”吗?我的许多URL都包含尺寸很大的Flash文件和美妙的图像,不仅仅是HTML。如果我把一个比较大的文件加以压缩,会不会有助于你更迅速地爬行呢?
谷歌机器人:对于这个问题,并没有一个简单的答案。首先,swf(Flash)、jpg、png、gif和pdf等文件格式本身已经是压缩过的了(而且还有专门的Flash 优化器)。
网站:或许我已经把自己的Flash文件进行了压缩,自己还不知道。很显然,我的效率很高喽。
谷歌机器人:Apache和IIS都提供了选项,允许进行gzip和deflate压缩,当然,节省带宽的代价是对CPU资源的更多消耗。一般情况下,这项功能只适用于比较容易压缩的文件,例如文本HTML/CSS/PHP内容等。而且,只有在用户的浏览器或者我(搜索引擎机器人)允许的情况下才可以使用。 就我个人而言,更倾向于“gzip”而不是“deflate”。Gzip的编码过程相对可靠一些,因为它不断地进行加和检查,并且保持完整的标头,不像 “deflate”那样需要我在工作中不断推测。除此之外,这两种程序的压缩算法语言都很相似。
如果你的服务器上有闲置的CPU资源,可以尝试进行压缩(链接:
Apache
,
IIS
)。但是,如果你提供的是动态内容,而且服务器的CPU已经处于满负荷状态,我建议你还是不要这样做。
网站:很长见识。我很高兴今晚你能来看我。感谢老天爷,我的
robots.txt
文件允许你能来。这个文件有时候就像对自己的子女过分保护的父母。
谷歌机器人:说到这里,该见见父母大人了——它就是robots.txt。我曾经见过不少发疯的“父母”。其中有些实际上只是HTML错误信息网页,而不是有效的robots.txt。有些文件里充满了无穷无尽的重定向,而且可能指向完全不相关的站点。另外一些体积庞大,含有成千上万条单独成行、各不相同的 URL。下面就是其中的一种有副作用的文件模式,在通常情况下,这个站点是希望我去爬行它的内容的:
User-Agent: *
Allow: /
然而,在某个用户流量的高峰时段,这个站点转而将它的robots.txt切换到限制性极强的机制上:
# Can you go away for a while? I'll let you back
# again in the future. Really, I promise!
User-Agent: *
Disallow: /
上述robots.txt文件切换的问题在于,一旦我看到这种限制性很强的robots.txt,有可能使我不得不把索引中已经爬行的该网站内容舍弃掉。当我再次被批准进入这个站点的时候,我不得不将原先的许多内容重新爬行一遍,至少会暂时出现503错误相应代码。
一 般来说,我每天只能重新检查一次robots.txt(否则,在许多虚拟主机站点上,我会将一大部分时间花在读取robots.txt文件上,要知道没有 多少约会对象喜欢如此频繁地拜见对方父母的)。站长们通过robots.txt 切换的方式来控制爬行频率是有副作用的,更好的办法是用网站管理员工具
将爬行频率调至“较低”
即可。
谷歌机器人: 网站老兄,谢谢你提出的这些问题,你一直做得很不错,但我现在不得不说“再见,我的爱人”了。
网站:哦,谷歌机器人…(结束应答):)
***************
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2008-03-01。
[null,null,["最后更新时间 (UTC):2008-03-01。"],[[["\u003cp\u003eGooglebot's headers are generally consistent globally, using variations for specific purposes like AdSense or image search.\u003c/p\u003e\n"],["\u003cp\u003eGooglebot selectively indexes content based on the search type, prioritizing HTML, PDFs, and images accordingly while avoiding less useful file types.\u003c/p\u003e\n"],["\u003cp\u003eGooglebot utilizes the \u003ccode\u003eContent-Type\u003c/code\u003e header to determine file formats and validates their structure before indexing to ensure content quality.\u003c/p\u003e\n"],["\u003cp\u003eGooglebot supports \u003ccode\u003egzip\u003c/code\u003e and \u003ccode\u003edeflate\u003c/code\u003e compression for content, with a preference for \u003ccode\u003egzip\u003c/code\u003e due to its robust encoding, which can improve crawl efficiency if implemented strategically.\u003c/p\u003e\n"],["\u003cp\u003eControlling crawl rate through robots.txt swapping is discouraged, and using Webmaster Tools to adjust crawl settings is recommended for better results.\u003c/p\u003e\n"]]],["Googlebot, a web crawler, interacts with a website to discuss its crawling behavior. Googlebot uses consistent headers globally but may vary the `User-Agent` for different tasks like AdSense or image search. It generally avoids cookies and prefers `gzip` compression. While it can download various file types, it prioritizes HTML, PDFs, and text, and it will likely download files with unknown extensions to assess the `Content-Type`. It prefers not to see a website frequently swap the robots.txt file, as it only checks it daily.\n"],null,["# First date with the Googlebot: Headers and compression\n\n| It's been a while since we published this blog post. Some of the information may be outdated (for example, some images may be missing, and some links may not work anymore).\n\nThursday, March 06, 2008\n\n\n**Name/User-Agent** : Googlebot \n\n**IP Address** :\n[Learn how to verify Googlebot](/search/docs/crawling-indexing/verifying-googlebot) \n\n**Looking For** : Websites with unique and compelling content \n\n**Major Turn Off** : Violations of the\n[Webmaster Guidelines](/search/docs/essentials)\n[Googlebot](/search/docs/crawling-indexing/googlebot) ---what a dreamboat. It's\nlike they know us `\u003chead\u003e`, `\u003cbody\u003e`, and soul. They're probably\nnot looking for anything exclusive; they see billions of other sites (though we share our data\nwith other bots as well), but tonight we'll really get to know each other as website and crawler.\n\n\nI know, it's never good to over-analyze a first date. We're going to get to know Googlebot a bit\nmore slowly, in a series of posts:\n\n1. Our first date (tonight!): Headers Googlebot sends, file formats they \"notice,\" whether it's better to compress data\n2. Judging their response: Response codes (`301`, `302`), how they handle redirects and `If-Modified-Since`\n3. Next steps: Following links, having them crawl faster or slower (so they don't come on too strong)\n\nAnd tonight is just the first date...\n\n*** ** * ** ***\n\n**Googlebot:** ACK\n\n**Website:** Googlebot, you're here!\n\n**Googlebot:** I am. \n\n```\nGET / HTTP/1.1\nHost: example.com\nConnection: Keep-alive\nAccept: */*\nFrom: googlebot(at)googlebot.com\nUser-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html)\nAccept-Encoding: gzip,deflate\n```\n\n\n**Website:** Those headers are so flashy! Would you crawl with the same headers if my site\nwere in the U.S., Asia or Europe? Do you ever use different headers?\n\n\n**Googlebot:** My headers are typically consistent world-wide. I'm trying to see what a page\nlooks like for the default language and settings for the site. Sometimes the\n`User-Agent` is different, for instance AdSense fetches use\n`Mediapartners-Google`: \n\n```\nUser-Agent: Mediapartners-Google\n```\n\nOr for image search: \n\n```\nUser-Agent: Googlebot-Image/1.0\n```\n\n\nWireless fetches often have carrier-specific user agents, whereas Google Reader RSS fetches\ninclude extra info such as number of subscribers.\n\n\nI usually avoid cookies (so no `Cookie:` header) since I don't want the content\naffected too much by session-specific info. And, if a server uses a session id in a dynamic URL\nrather than a cookie, I can usually figure this out, so that I don't end up crawling your same\npage a million times with a million different session ids.\n\n\n**Website:** I'm very complex. I have many file types. Your headers say\n`Accept: */*`. Do you index all URLs or are certain file extensions automatically\nfiltered?\n\n**Googlebot:** That depends on what I'm looking for. If I'm indexing for regular web search,\nand I see links to MP3s and videos, I probably won't download those. Similarly, if I see a JPG, I\nwill treat it differently than an HTML or PDF link. For instance, JPG is much less likely to\nchange frequently than HTML, so I will check the JPG for changes less often to save bandwidth.\nMeanwhile, if I'm looking for links as Google Scholar, I'm going to be far more interested in the\nPDF article than the JPG file. Downloading doodles (like JPGs) and videos of skateboarding dogs\nis distracting for a scholar---do you agree?\n\n\n**Website:** Yes, they can be distracting. I'm in awe of your dedication. I love doodles (JPGs)\nand find them hard to resist.\n\n\n**Googlebot:** Me, too; I'm not always so scholarly. When I crawl for image search, I'm very\ninterested in JPGs. And for news, I'm mostly looking at HTML and nearby images.\n\n\nThere are also plenty of extensions (exe, dll, zip, dmg...), that tend to be big and less useful\nfor a search engine.\n\n\n**Website:** If you saw my URL, `https://www.example.com/page1.LOL111`, would you\n(whimper whimper) reject it just because it contains an unknown file extension?\n\n\n**Googlebot:** Website, let me give a bit more background. After actually downloading a file, I\nuse the `Content-Type` header to check whether it really is HTML, an image, text, or\nsomething else. If it's a special data type like a PDF file, Word document, or Excel spreadsheet, I\n'll make sure it's in the valid format and extract the text content. Maybe it has a virus; you\nnever know. If the document or data type is really garbled, there's usually not much to do besides\ndiscard the content.\n\n\nSo, if I'm crawling `https://www.example.com/page1.LOL111` with an unknown file\nextension, it's likely that I would start to download it. If I can't figure out the content type\nfrom the header, or it's a format that we don't index (for example, mp3), then it'll be put aside.\nOtherwise, we proceed indexing the file.\n\n\n**Website:** My apologies for scrutinizing your style, Googlebot, but I noticed your\n`Accept-Encoding` headers say: \n\n```\nAccept-Encoding: gzip,deflate\n```\n\nCan you explain these headers to me?\n\n\n**Googlebot:** Sure. All major search engines and web browsers support gzip compression for\ncontent to save bandwidth. Other entries that you might see here include `x-gzip` (the\nsame as `gzip`), `deflate` (which we also support), and\n`identity` (none).\n\n\n**Website:** Can you talk more about file compression and\n`Accept-Encoding: gzip,deflate`? Many of my URLs consist of big Flash files and\nstunning images, not just HTML. Would it help you to crawl faster if I compressed my larger files?\n\n\n**Googlebot:** There's not a simple answer to this question. First of all, many file formats,\nsuch as swf (Flash), jpg, png, gif, and pdf are already compressed (there are also specialized\nFlash optimizers).\n\n\n**Website:** Perhaps I've been compressing my Flash files and I didn't even know? I'm obviously\nvery efficient.\n\n\n**Googlebot:** Both Apache and IIS have options to enable gzip and deflate compression, though\nthere's a CPU cost involved for the bandwidth saved. Typically, it's only enabled for easily\ncompressible text HTML/CSS/PHP content. And it only gets used if the user's browser or I (a search\nengine crawler) allow it. Personally, I prefer `gzip` over `deflate`. Gzip\nis a slightly more robust encoding---there is consistently a checksum and a full header,\ngiving me less guess-work than with deflate. Otherwise they're very similar compression\nalgorithms.\n\n\nIf you have some spare CPU on your servers, it might be worth experimenting with compression\n(links:\n[Apache](https://www.sitepoint.com/article/web-output-mod_gzip-apache),\n[IIS](https://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/502ef631-3695-4616-b268-cbe7cf1351ce.mspx?mfr=true)).\nBut, if you're serving dynamic content and your servers are already heavily CPU loaded, you might\nwant to hold off.\n\n\n**Website:** Great information. I'm really glad you came tonight---thank goodness my\n[robots.txt](/search/docs/crawling-indexing/robots/intro) allowed it. That file can be like an\nover-protective parent!\n\n\n**Googlebot:** Ah yes; meeting the parents, the robots.txt. I've met plenty of intense ones.\nSome are really just HTML error pages rather than valid robots.txt. Some have infinite redirects\nall over the place, maybe to totally unrelated sites, while others are just huge and have\nthousands of different URLs listed individually. Here's one unfortunate pattern. The site is\nnormally eager for me to crawl: \n\n```\nUser-Agent: *\nAllow: /\n```\n\n\nThen, during a peak time with high user traffic, the site switches the robots.txt to something\nrestrictive: \n\n```\n# Can you go away for a while? I'll let you back\n# again in the future. Really, I promise!\n\nUser-Agent: *\nDisallow: /\n```\n\n\nThe problem with the above robots.txt file-swapping is that once I see the restrictive robots.txt,\nI may have to start throwing away content I've already crawled in the index. And then I have to\nrecrawl a lot of content once I'm allowed to crawl the site again. At least a 503 response code\nwould've been temporary.\n\n\nI typically only re-check robots.txt once a day (otherwise on many virtual hosting sites, I'd be\nspending a large fraction of my fetches just getting robots.txt, and no date wants to \"meet the\nparents\" that often). For webmasters, trying to control crawl rate through robots.txt swapping\nusually backfires. It's better to\n[set the rate to \"slower\"](https://support.google.com/webmasters/answer/48620)\nin Webmaster Tools.\n\n\n**Googlebot:** Website, thanks for all of your questions, you've been wonderful, but I'm going\nto have to say \"FIN, my love.\"\n\n\n**Website:** Oh, Googlebot... ACK/FIN.\n:)\n\n*** ** * ** ***\n\nWritten by [Maile Ohye](/search/blog/authors/maile-ohye) as the website, Jeremy Lilley as the Googlebot"]]