This guide describes how to optimize Google's crawling of very large and frequently updated sites.
If your site does not have a large number of pages that change rapidly, or if your pages seem to be crawled the same day that they are published, you don't need to read this guide; merely keeping your sitemap up to date and checking your index coverage regularly is adequate.
If you have content that's been available for a while but has never been indexed, this is a different problem; use the URL Inspection tool instead to find out why your page isn't being indexed.
Who this guide is for
This is an advanced guide and is intended for:
- Large sites (1 million+ unique pages) with content that changes moderately often (once a week)
- Medium or larger sites (10,000+ unique pages) with very rapidly changing conten (daily)
General theory of crawling
The web is a nearly infinite space, exceeding Google's ability to explore and index every available URL. As a result, there are limits to how much time Googlebot can spend crawling any single site. The amount of time and resources that Google devotes to crawling a site is commonly called the site's crawl budget. Note that not everything crawled on your site will necessarily be indexed; each page must be evaluated, consolidated, and assessed to determine whether it will be indexed after it has been crawled.
Crawl budget is determined by two main elements: crawl capacity limit and crawl demand.
Crawl capacity limit
Googlebot wants to crawl your site without overwhelming your servers. To prevent this, Googlebot calculates a crawl capacity limit, which is the maximum number of simultaneous parallel connections that Googlebot can use to crawl a site, as well as the time delay between fetches. This is calculated to provide coverage of all your important content without overloading your servers.
The crawl capacity limit can go up and down based on a few factors:
- Crawl health: If the site responds quickly for a while, the limit goes up, meaning more connections can be used to crawl. If the site slows down or responds with server errors, the limit goes down and Googlebot crawls less.
- Limit set by site owner in Search Console: Website owners can optionally reduce Googlebot's crawling of their site. Note that setting higher limits won't automatically increase crawling.
- Google's crawling limits: Google has a lot of machines, but not infinite machines. We still need to make choices with the resources that we have.
Google typically spends as much time as necessary crawling a site, given its size, update frequency, page quality, and relevance, compared to other sites.
The factors that play a significant role in determining crawl demand are:
- Perceived inventory: Without guidance from you, Googlebot will try to crawl all or most of the URLs that it knows about on your site. If many of these URLs are duplicates, or you don't want them crawled for some other reason (removed, unimportant, and so on), this wastes a lot of Google crawling time on your site. This is the factor that you can positively control the most.
- Popularity: URLs that are more popular on the Internet tend to be crawled more often to keep them fresher in our index.
- Staleness: Our systems want to recrawl documents frequently enough to pick up any changes.
Additionally, site-wide events like site moves may trigger an increase in crawl demand in order to reindex the content under the new URLs.
Taking crawl capacity and crawl demand together, Google defines a site's crawl budget as the set of URLs that Googlebot can and wants to crawl. Even if the crawl capacity limit isn't reached, if crawl demand is low, Googlebot will crawl your site less.
Follow these best practices to maximize your crawling efficiency:
- Manage your URL inventory: Use the appropriate
tools to tell Google which pages to crawl and which not to crawl. If Google spends too much
time crawling URLs that aren't appropriate for the index, Googlebot might decide that it's
not worth the time to look at the rest of your site (or increase your budget to do so).
- Consolidate duplicate content. Eliminate duplicate content to focus crawling on unique content rather than unique URLs.
- Block crawling of URLs that you don't want indexed. Some pages might be important to users, but you don't want them to appear in Search results. For example, infinite scrolling pages that duplicate information on linked pages, or differently sorted versions of the same page. If you can't consolidate them as described in the first bullet, block these unimportant (for search) pages using robots.txt or the URL Parameters tool (for duplicate content reached by URL parameters).
- Return 404/410 for permanently removed pages. Google won't forget a URL that it knows about, but a 404 is a strong signal not to crawl that URL again. Blocked URLs, however, will stay part of your crawl queue much longer, and will be recrawled when the block is removed.
- Eliminate soft 404s. Soft 404s will continue to be crawled, and waste your budget. Check the Index Coverage report for soft 404 errors.
- Keep your sitemaps up to date. Google reads your sitemap regularly,
so be sure to include all the content that you want Google to crawl. If your site
includes updated content, we recommend including the
- Avoid long redirect chains, which have a negative effect on crawling.
- Make your pages efficient to load. If Google can load and render your pages faster, we might be able to read more content from your site.
- Monitor your site crawling. Monitor whether your site had any availability issues during crawling, and look for ways to make your crawling more efficient.
Monitor your site's crawling and indexing
Here are the key steps to monitoring your site's crawl profile:
- See if Googlebot is encountering availability issues on your site.
- See whether you have pages that aren't being crawled, but should be.
- See whether any parts of your site need to be crawled more quickly than they already are.
- Improve your site's crawl efficiency.
- Handle overcrawling of your site.
1. See if Googlebot is encountering availability issues on your site
Improving your site availability won't necessarily increase your crawl budget; Google determines the best crawl rate based on the crawl demand, as described previously. However, availability issues do prevent Google from crawling your site as much as it might want to.
Use the Crawl Stats report to see Googlebot's crawling history for your site. The report shows when Google encountered availability issues on your site. If availability errors or warnings are reported for your site, look for instances in the Host availability graphs where Googlebot requests exceeded the red limit line, click into the graph to see which URLs were failing, and try to correlate those with issues on your site.
- Read the documentation for the Crawl Stats report to learn how to find and handle some availability issues.
- Block pages from crawling if you don't want them to be crawled. (See manage your inventory)
- Increase page loading and rendering speed. (See Improve your site's crawl efficiency)
- Increase your server capacity. If Google consistently seems to be crawling your site at its serving capacity limit, but you still have important URLs that aren't being crawled or updated as much as they need, having more serving resources might enable Google to request more pages on your site. Check your host availability history in the Crawl Stats report to see if Google's crawl rate seems to be crossing the limit line often. If so, increase your serving resources for a month and see whether crawling requests increased during that same period.
2. See if any parts of your site are not crawled, but should be
Google spends as much time as necessary on your site in order to index all the high-quality, user-valuable content that it can find. If you think that Googlebot is missing important content, either it doesn't know about the content, the content is blocked from Google, or your site availability is throttling Google's access (or Google is trying not to overload your site).
Search Console doesn't provide a crawl history for your site that can be filtered by URL or path, but you can inspect your site logs to see whether specific URLs have been crawled by Googlebot. Whether or not those crawled URLs have been indexed is another story.
Remember that for most sites, new pages will take several days minimum to be noticed; most sites shouldn't expect same-day crawling for URLs, with the exception of time-sensitive sites such as news sites.
If you are adding pages to your site and they are not being crawled in a reasonable amount of time, either Google doesn't know about them, the content is blocked, your site has reached its maximum serving capacity, or you are out of crawl budget.
- Tell Google about your new pages: update your sitemaps to reflect new URLs.
- Examine your robots.txt rules to confirm that you're not accidentally blocking pages.
- If all your non-crawled pages have URL parameters, it's possible that your pages were excluded because of settings in the URL Parameters tool; unfortunately there isn't a way to check for such an exclusion, which is why we typically recommend against using that tool.
- Review your crawling priorities (a.k.a. use your crawl budget wisely). Manage your inventory and improve your site's crawling efficiency.
- Check that you're not running out of serving capacity. Googlebot will scale back its crawling if it detects that your servers are having trouble responding to crawl requests.
Note that pages might not be shown in search results, even if crawled, if there isn't sufficient value or user demand for the content.
3. See if updates are crawled quickly enough
If we're missing new or updated pages on your site, perhaps it's because we haven't seen them, or haven't noticed that they are updated. Here is how you can help us be aware of page updates.
Note that Google strives to check and index pages in a reasonably timely manner. For most sites, this is three days or more. Don't expect Google to index pages the same day that you publish them unless you are a news site or have other high-value, extremely time-sensitive content.
Examine your site logs to see when specific URLs were crawled by Googlebot.
To learn the indexing date, use the URL Inspection tool or do a Google search for URLs that you updated.
- Use a news sitemap if your site has news content. Ping Google when your sitemap is posted or has changed.
- Use the
<lastmod>tag in sitemaps to indicate when an indexed URL has been updated.
- Use a simple URL structure to help Google find your pages.
- Provide standard, crawlable
<a>links to help Google find your pages.
- Submitting the same, unchanged sitemap multiple times per day.
- Expecting that Googlebot will crawl everything in a sitemap, or crawl them immediately. Sitemaps are useful suggestions to Googlebot, not absolute requirements.
- Including URLs in your sitemaps that you don't want to appear in Search. This can waste your crawl budget on pages that you don't want indexed.
4. Improve your site's crawl efficiency
Increase your page loading speed
Google's crawling is limited by bandwidth, time, and availability of Googlebot instances. If your server responds to requests quicker, we might be able to crawl more pages on your site. That said, Google only wants to crawl high quality content, so simply making low quality pages faster won't encourage Googlebot to crawl more of your site; conversely, if we think that we're missing high-quality content on your site, we'll probably increase your budget to crawl that content.
Here's how you can optimize your pages and resources for crawling:
- Prevent large but unimportant resources from being loaded by Googlebot using robots.txt. Be sure to block only non-critical resources--that is, resources that aren't important to understanding the meaning of the page (such as decorative images).
- Make sure that your pages are fast to load.
- Watch out for long redirect chains, which have a negative effect on crawling.
- Both the time to respond to server requests, as well as the time needed to render pages, matters, including load and run time for embedded resources such as images and scripts. Be aware of large or slow resources required for indexing.
Hide URLs that you don't want in search results
Wasting server resources on unnecessary pages can reduce crawl activity from pages that are important to you, which may cause a significant delay in discovering great new or updated content on a site.
Exposing many URLs on your site that you don't want crawled by Search can negatively affect a site's crawling and indexing. Typically these URLs fall into the following categories:
- Faceted navigation and session identifiers: Faceted navigation is typically duplicate content from the site; session identifiers and other URL parameters that simply sort or filter the page don't provide new content. Use robots.txt to block faceted navigation pages. If you find that Google is crawling a significant number of essentially duplicate URLs with different parameters on your site, consider blocking parameterized duplicate content.
- Duplicate content: Help Google identify duplicate content to avoid unnecessary crawling.
- Soft 404 pages: Return a 404 code when a page no longer exists.
- Hacked pages: Be sure to check the Security Issues report and fix or remove any hacked pages you find.
- Infinite spaces and proxies: Block these from crawling with robots.txt.
- Low quality and spam content: Good to avoid, obviously.
- Shopping cart pages, infinite scrolling pages, and pages that perform an action (such as "sign up" or "buy now" pages).
- Use robots.txt if you don't want Google to crawl a resource or page at all.
- Don't add or remove pages or directories from robots.txt regularly as a way of "freeing up" some additional crawl budget for your site. Use robots.txt only for pages or resources that you don't want to appear on Google for the long run.
- Don't rotate sitemaps or use other temporary hiding mechanisms to "free up more budget."
5. Handle overcrawling of your site (emergencies)
Googlebot has algorithms to prevent it from overwhelming your site with crawl requests. However, if you find that Googlebot is overwhelming your site, there are a few things you can do.
Monitor your server for excessive Googlebot requests to your site.
In an emergency, we recommend the following steps to slow down an overwhelming crawl from Googlebot:
- Return 503/429 HTTP result codes temporarily for Googlebot requests when your server is overloaded. Googlebot will retry these URLs for about 2 days. Note that returning "no availability" codes for more than a few days will cause Google to permanently slow or stop crawling URLs on your site, so follow the additional next steps.
- Reduce the Googlebot crawl rate for your site. This can take up to 2 days to take effect, and requires Search Console property owner permissions. Do this only if you see long-term, repeated overcrawling from Google in the Crawl Stats report, in the Host availability > Host utilization chart.
- When the crawl rate goes down, stop returning 503/429 for crawl requests; returning 503 for more than 2 days will cause Google to drop the 503 URLs from the index.
- Monitor your crawling and your host capacity over time, and if appropriate, increase your crawl rate again, or allow the default crawling rate.
- If the problematic crawler is one of the AdsBot crawlers, the problem is likely that you have created Dynamic Search Ad targets for your site that Google is trying to crawl. This crawl will reoccur every 2 weeks. If you don't have the server capacity to handle these crawls, either limit your ad targets or get increased serving capacity.
Myths and facts about crawling
Test your knowledge on how Google crawls and indexes websites.
nofollowdirective affects crawl budget.
nofollow, it can still be crawled if another page on your site, or any page on the web, doesn't label the link as