优化网站的抓取与收录
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
2009年8月31日星期一
发表者:
Susan Moskwa, 网站管理员趋势分析员
原文:
Optimize your crawling & indexing
发表于: 2009年8月9日星期日, 下午10:40
很多有关于网站结构、抓取与收录、甚至是排名的问题都可以被归结为一个中心问题,那就是:
搜索引擎能够多么容易的抓取您的网站?
我们在最近的几次活动上都谈到过这个话题,下面您将会看到我们关于这个问题的演讲内容以及要点概括。
网络世界极其庞大
;每时每刻都在产生新的内容。Google 本身的资源是有限的,当面对几近无穷无尽的网络内容的时候,Googlebot 只能找到和抓取其中一定比例的内容。然后,在我们已经抓取到的内容中,我们也只能索引其中的一部分。
URLs 就像网站和搜索引擎抓取工具之间的桥梁: 为了能够抓取到您网站的内容,抓取工具需要能够找到并跨越这些桥梁(也就是找到并抓取您的URLs)。如果您的URLs很复杂或冗长,抓取工具不得不需要反复花时间去跟踪这些网址;如果您的URLs很规整并且直接指向您的独特内容,抓取工具就可以把精力放在了解您的内容上,而不是白白花在抓取空网页或被不同的URLs指引却最终只是抓取到了相同的重复内容。
在上面的幻灯片上,您可以看到一些我们应当避免的反例--这些都是现实中存在的URL例子(尽管他们的名称由于保护隐私的原因已经被替换了),这些例子包括被黑的URL和编码,冗余的参数伪装成URL路径的一部分,无限的抓取空间,等等。您还可以找到帮助您理顺这些网址迷宫和帮助抓取工具更快更好地找到您的内容的一些建议,主要包括:
那些不会对网页内容产生影响的URL中的参数——例如session ID或者排序参数——是可以从URL中去除的,并被cookie记录的。通过将这些信息加入cookie,然后
301重定向
至一个“干净”的URL,你可以保持原有的内容,并减少多个URL指向同一内容的情况。
你的网站上是否有一个日历表,上面的链接指向无数个过去和将来的日期(每一个链接地址都独一无二)?你的网页地址是否在加入一个&page=3563的参数之后,仍然可以返回200代码,哪怕根本没有这么多页?如果是这样的话,你的网站上就出现了所谓的“
无限空间
”,这种情况会浪费抓取机器人和你的网站的带宽。如何控制好“无限空间”,参考这里的
一些技巧
吧。
通过使用你的robots.txt 文件,你可以阻止你的登录页面,联系方式,购物车以及其他一些爬虫不能处理的页面被抓取。(爬虫是以他的吝啬和害羞而著名,所以一般他们不会自己 “往购物车里添加货物” 或者 “联系我们”)。通过这种方式,你可以让爬虫花费更多的时间抓取你的网站上他们能够处理的内容。
在理想的世界里,URL和内容之间有着一对一的对应:每一个URL会对应一段独特的内容,而每一段内容只能通过唯一的一个URL访问。越接近这样的理想状况,你的网站会越容易被抓取和收录。如果你的内容管理系统或者目前的网站建立让它实现起来比较困难,你可以尝试使用
rel=canonical
元素去设定你想用的URL去指示某个特定的内容。
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2009-08-01。
[null,null,["最后更新时间 (UTC):2009-08-01。"],[[["\u003cp\u003eGooglebot has limited resources and can only crawl and index a portion of the web's content, so site architecture is crucial for efficient crawling.\u003c/p\u003e\n"],["\u003cp\u003eWell-structured URLs help search engines easily access and understand website content, while disorganized URLs waste crawl resources.\u003c/p\u003e\n"],["\u003cp\u003eRemoving unnecessary URL parameters, managing infinite crawl spaces, and disallowing irrelevant actions for Googlebot improves crawl efficiency.\u003c/p\u003e\n"],["\u003cp\u003eEnsure each unique piece of content has one corresponding URL, using canonicalization if needed, to optimize crawling and indexing.\u003c/p\u003e\n"],["\u003cp\u003eOptimizing your website's crawlability allows Googlebot to discover and index valuable content more effectively.\u003c/p\u003e\n"]]],["Search engine crawlers navigate websites via URLs; simplifying these URLs is crucial for efficient crawling. Key actions include removing irrelevant URL parameters, managing infinite crawl spaces like calendars or excessive pagination, and disallowing non-functional pages (e.g., login pages) in `robots.txt`. Ideally, each URL should lead to unique content. Using cookies for session data, employing `301` redirects for cleaner URLs, and the `rel=\"canonical\"` tag can streamline crawling and indexing processes.\n"],null,["# Optimize your crawling and indexing\n\nMonday, August 10, 2009\n| It's been a while since we published this blog post. Some of the information may be outdated (for example, some images may be missing, and some links may not work anymore). For current information, check out our [Advanced guide to how Search works](/search/docs/fundamentals/how-search-works).\n\n\nMany questions about website architecture, crawling and indexing, and even ranking issues can be\nboiled down to one central issue:\n**How easy is it for search engines to crawl your site?**\nWe've spoken on this topic at a number of recent events, and below you'll find our presentation\nand some key takeaways on this topic.\n\n\n[The Internet is a *big*place](https://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html);\nnew content is being created all the time. Google has a finite number of resources, so when faced\nwith the nearly-infinite quantity of content that's available online, Googlebot is only able to\nfind and crawl a percentage of that content. Then, of the content we've crawled, we're only able\nto index a portion.\n\n\nURLs are like the bridges between your website and a search engine's crawler: crawlers need to be\nable to find and cross those bridges (that is, find and crawl your URLs) in order to get to your\nsite's content. If your URLs are complicated or redundant, crawlers are going to spend time\ntracing and retracing their steps; if your URLs are organized and lead directly to distinct\ncontent, crawlers can spend their time accessing your content rather than crawling through empty\npages, or crawling the same content over and over via different URLs.\n\n\nIn the slides above you can see some examples of what *not* to do---real-life examples\n(though names have been changed to protect the innocent) of homegrown URL hacks and encodings,\nparameters masquerading as part of the URL path, infinite crawl spaces, and more. You'll also\nfind some recommendations for straightening out that labyrinth of URLs and helping crawlers find\nmore of your content faster, including:\n\n- **Remove user-specific details from URLs.** URL parameters that don't change the content of the page---like session IDs or sort order---can be removed from the URL and put into a cookie. By putting this information in a cookie and [`301` redirecting](/search/docs/crawling-indexing/301-redirects) to a \"clean\" URL, you retain the information and reduce the number of URLs pointing to that same content.\n- **Rein in infinite spaces.** Do you have a calendar that links to an infinite number of past or future dates (each with their own unique URL)? Do you have paginated data that returns a [status code of `200`](/search/docs/crawling-indexing/http-network-errors) when you add `&page=3563` to the URL, even if there aren't that many pages of data? If so, you have an [infinite crawl space](/search/blog/2008/08/to-infinity-and-beyond-no) on your website, and crawlers could be wasting their (and your!) bandwidth trying to crawl it all. Consider [these tips](https://www.google.com/support/webmasters/bin/answer.py?answer=76401) for reining in infinite spaces.\n- **Disallow actions Googlebot can't perform.** Using your [robots.txt file](/search/docs/crawling-indexing/robots/intro), you can disallow crawling of login pages, contact forms, shopping carts, and other pages whose sole functionality is something that a crawler can't perform. (Crawlers are notoriously cheap and shy, so they don't usually \"Add to cart\" or \"Contact us.\") This lets crawlers spend more of their time crawling content that they can actually do something with.\n- **One man, one vote. One URL, one set of content.** In an ideal world, there's a one-to-one pairing between URL and content: each URL leads to a unique piece of content, and each piece of content can only be accessed via one URL. The closer you can get to this ideal, the more streamlined your site will be for crawling and indexing. If your CMS or current site setup makes this difficult, you can [use the `rel=\"canonical\"` element](/search/docs/crawling-indexing/consolidate-duplicate-urls) to indicate the preferred URL for a particular piece of content.\n\n\nIf you have further questions about optimizing your site for crawling and indexing, check out some\nof our [previous writing](/search/help/crawling-index-faq) on the subject, or stop by\nour\n[Help Forum](https://support.google.com/webmasters/community).\n\n\nPosted by\n[Susan Moskwa](/search/blog/authors/susan-moskwa),\nWebmaster Trends Analyst"]]