Abstract
This document details how Google handles the robots.txt file that allows you to control how Google's website crawlers crawl and index publicly accessible websites.
Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Basic definitions
Definitions | |
---|---|
Crawler | A crawler is a service or agent that crawls websites. Generally speaking, a crawler automatically and recursively accesses known URLs of a host that exposes content which can be accessed with standard web-browsers. As new URLs are found (through various means, such as from links on existing, crawled pages or from Sitemap files), these are also crawled in the same way. |
User-agent | A means of identifying a specific crawler or set of crawlers. |
Directives | The list of applicable guidelines for a crawler or group of crawlers set forth in the robots.txt file. |
URL | Uniform Resource Locators as defined in RFC 1738. |
Google-specific | These elements are specific to Google's implementation of robots.txt and may not be relevant for other parties. |
Applicability
The guidelines set forth in this document are followed by all automated crawlers at Google. When an agent accesses URLs on behalf of a user (for example, for translation, manually subscribed feeds, malware analysis, etc), these guidelines do not need to apply.
File location & range of validity
The robots.txt file must be in the top-level directory of the host, accessible though the appropriate protocol and port number. Generally accepted protocols for robots.txt (and crawling of websites) are "http" and "https". On http and https, the robots.txt file is fetched using a HTTP non-conditional GET request.
Google-specific: Google also accepts and follows robots.txt files for FTP sites. FTP-based robots.txt files are accessed via the FTP protocol, using an anonymous login.
The directives listed in the robots.txt file apply only to the host, protocol and port number where the file is hosted.
Examples of valid robots.txt URLs
Robots.txt URL examples | |
---|---|
http://example.com/robots.txt |
Valid for:
|
http://www.example.com/robots.txt |
Valid for: Not valid for:
|
http://example.com/folder/robots.txt |
Not a valid robots.txt file. Crawlers will not check for robots.txt files in subdirectories. |
http://www.müller.eu/robots.txt |
Valid for:
Not valid for: |
ftp://example.com/robots.txt |
Valid for: Not valid for: Google-specific: We use the robots.txt for FTP resources. |
http://212.96.82.21/robots.txt |
Valid for: Not valid for: |
http://example.com:80/robots.txt |
Valid for:
Not valid for: |
http://example.com:8181/robots.txt |
Valid for: Not valid for: |
Handling HTTP result codes
There are generally three different outcomes when robots.txt files are fetched:
- full allow: All content may be crawled.
- full disallow: No content may be crawled.
- conditional allow: The directives in the robots.txt determine the ability to crawl certain content.
Handling HTTP result codes | |
---|---|
2xx (successful) | HTTP result codes that signal success result in a "conditional allow" of crawling. |
3xx (redirection) | Redirects will generally be followed until a valid result can be found (or a loop is recognized). We will follow a limited number of redirect hops (RFC 1945 for HTTP/1.0 allows up to 5 hops) and then stop and treat it as a 404. Handling of robots.txt redirects to disallowed URLs is undefined and discouraged. Handling of logical redirects for the robots.txt file based on HTML content that returns 2xx (frames, JavaScript, or meta refresh-type redirects) is undefined and discouraged. |
4xx (client errors) | Google treats all 4xx errors in the same way and assumes that no valid robots.txt file exists. It is assumed that there are no restrictions. This is a "full allow" for crawling. |
5xx (server error) | Server errors are seen as temporary errors that result in a "full disallow" of crawling. The request is retried until a non-server-error HTTP result code is obtained. A 503 (Service Unavailable) error will result in fairly frequent retrying. To temporarily suspend crawling, it is recommended to serve a 503 HTTP result code. Handling of a permanent server error is undefined. Google-specific: If we are able to determine that a site is incorrectly configured to return 5xx instead of a 404 for missing pages, we will treat a 5xx error from that site as a 404. |
Unsuccessful requests or incomplete data | Handling of a robots.txt file which cannot be fetched due to DNS or networking issues such as timeouts, invalid responses, reset / hung up connections, HTTP chunking errors, etc. is undefined. |
Caching | A robots.txt request is generally cached for up to one day, but may be cached longer in situations where refreshing the cached version is not possible (for example, due to timeouts or 5xx errors). The cached response may be shared by different crawlers. Google may increase or decrease the cache lifetime based on max-age Cache-Control HTTP headers. |
File format
The expected file format is plain text encoded in UTF-8. The file consists of records (lines) separated by CR, CR/LF or LF.
Only valid records will be considered; all other content will be ignored. For example, if the resulting document is a HTML page, only valid text lines will be taken into account, the rest will be discarded without warning or error.
If a character encoding is used that results in characters being used which are not a subset of UTF-8, this may result in the contents of the file being parsed incorrectly.
An optional Unicode BOM (byte order mark) at the beginning of the robots.txt file is ignored.
Each record consists of a field, a colon, and a value. Spaces are
optional (but recommended to improve readability). Comments can be
included at any location in the file using the "#" character; all
content after the start of a comment until the end of the record is
treated as a comment and ignored. The general format is
<field>:<value><#optional-comment>
. Whitespace
at the beginning and at the end of the record is ignored.
The <field>
element is case-insensitive. The <value>
element may be case-sensitive, depending on the <field>
element.
Handling of <field>
elements with simple errors / typos
(eg "useragent" instead of "user-agent") is undefined and may be
interpreted as correct directives by some user-agents.
A maximum file size may be enforced per crawler. Content which is after the maximum file size may be ignored. Google currently enforces a size limit of 500 kilobytes (KB).
Formal syntax / definition
This is a Backus-Naur Form (BNF)-like description, using the conventions of RFC 822, except that "|" is used to designate alternatives. Literals are quoted with "", parentheses "(" and ")" are used to group elements, optional elements are enclosed in [brackets], and elements may be preceded with <n>* to designate n or more repetitions of the following element; n defaults to 0.
robotstxt = *entries entries = *( ( <1>*startgroupline *(groupmemberline | nongroupline | comment) | nongroupline | comment) ) startgroupline = [LWS] "user-agent" [LWS] ":" [LWS] agentvalue [comment] EOL groupmemberline = [LWS] ( pathmemberfield [LWS] ":" [LWS] pathvalue | othermemberfield [LWS] ":" [LWS] textvalue) [comment] EOL nongroupline = [LWS] ( urlnongroupfield [LWS] ":" [LWS] urlvalue | othernongroupfield [LWS] ":" [LWS] textvalue) [comment] EOL comment = [LWS] "#" *anychar agentvalue = textvalue pathmemberfield = "disallow" | "allow" othermemberfield = () urlnongroupfield = "sitemap" othernongroupfield = () pathvalue = "/" path urlvalue = absoluteURI textvalue = *(valuechar | SP) valuechar = <any UTF-8 character except ("#" CTL)> anychar = <any UTF-8 character except CTL> EOL = CR | LF | (CR LF)
The syntax for "absoluteURI", "CTL", "CR", "LF", "LWS" are defined in RFC 1945. The syntax for "path" is defined in RFC 1808.
Grouping of records
Records are categorized into different types based on the type of
<field>
element:
- start-of-group
- group-member
- non-group
All group-member records after a start-of-group record up to the next
start-of-group record are treated as a group of records. The only
start-of-group field element is user-agent
.
Muiltiple start-of-group
lines directly after each other will follow the group-member records
following the final start-of-group line. Any group-member records
without a preceding start-of-group record are ignored. All non-group
records are valid independently of all groups.
Valid <field>
elements, which will be individually detailed further
on in this document, are:
user-agent
(start of group)disallow
(only valid as a group-member record)allow
(only valid as a group-member record)sitemap
(non-group record)
All other <field>
elements may be ignored.
The start-of-group element user-agent
is used to specify
for which crawler the group is valid. Only one group of records is valid
for a particular crawler. We will cover order of precedence later in this
document.
Example groups:
user-agent: a disallow: /c user-agent: b disallow: /d user-agent: e user-agent: f disallow: /g
There are three distinct groups specified, one for "a" and one for "b" as well as one for both "e" and "f". Each group has its own group-member record. Note the optional use of white-space (an empty line) to improve readability.
Order of precedence for user-agents
Only one group of group-member records is valid for a particular crawler.
The crawler must determine the correct group of records by finding the
group with the most specific user-agent that still matches. All other
groups of records are ignored by the crawler. The user-agent is
non-case-sensitive. All non-matching text is ignored (for example, both
googlebot/1.2
and googlebot*
are
equivalent to googlebot
). The
order of the groups within the robots.txt file is irrelevant.
Example
Assuming the following robots.txt file:
user-agent: googlebot-news (group 1) user-agent: * (group 2) user-agent: googlebot (group 3)
This is how the crawlers would choose the relevant group:
Record group followed per crawler | |
---|---|
Googlebot News | The record group followed is group 1. Only the most specific group is followed, all thers are ignored. |
Googlebot (web) | The record group followed is group 3. |
Googlebot Images | The record group followed is group 3. There is no specific googlebot-images
group, so the more generic group is followed. |
Googlebot News (when crawling images) | >The record group followed is group 1. These images are crawled for and by Googlebot News, therefore only the Googlebot News group is followed. |
Otherbot (web) | The record group followed is group 2. |
Otherbot (News) | The record group followed is group 2. Even if there is an entry for a related crawler, it is only valid if it is specifically matching. |
Also see Google's crawlers and user-agent strings.
Group-member records
Only general and Google-specific group-member record types are covered in
this section. These record types are also called "directives" for the
crawlers. These directives are specified in the form of directive:
[path]
where [path]
is optional. By default, there are no restrictions
for crawling for the designated crawlers. Directives without a [path]
are ignored.
The [path]
value, if specified, is to be seen relative from the root of
the website for which the robots.txt file was fetched (using the same
protocol, port number, host and domain names). The path value must start
with "/" to designate the root. The path is case-sensitive. More
information can be found in the section "URL matching based on path
values" below.
disallow
The disallow
directive specifies paths that must not be
accessed by the designated crawlers. When no path is specified, the
directive is ignored.
Usage:
disallow: [path]
allow
The allow
directive specifies paths that may be accessed by the
designated crawlers. When no path is specified, the directive is
ignored.
Usage:
allow: [path]
URL matching based on path values
The path value is used as a basis to determine whether or not a rule applies to a specific URL on a site. With the exception of wildcards, the path is used to match the beginning of a URL (and any valid URLs that start with the same path). Non-7-bit ASCII characters in a path may be included as UTF-8 characters or as percent-escaped UTF-8 encoded characters per RFC 3986.
Google, Bing, Yahoo, and Ask support a limited form of "wildcards" for path values. These are:
*
designates 0 or more instances of any valid character.$
designates the end of the URL.
Example path matches | |
---|---|
/ |
Matches the root and any lower level URL |
/* |
Equivalent to / . The trailing wildcard is ignored. |
/fish |
Matches:
Does not match:
|
/fish* |
Equivalent to Matches:
Does not match:
|
/fish/ |
The trailing slash means this matches anything in this folder. Matches:
Does not match:
|
/*.php |
Matches:
Does not match:
|
/*.php$ |
Matches:
Does not match:
|
/fish*.php |
Matches:
Does not match: |
Google-supported non-group-member records
sitemap
Supported by Google, Ask, Bing, Yahoo; defined on sitemaps.org.
Usage:
sitemap: [absoluteURL]
[absoluteURL]
points to a Sitemap, Sitemap Index file or equivalent URL.
The URL does not have to be on the same host as the robots.txt file.
Multiple sitemap
entries may exist. As non-group-member
records, these are not tied to any specific user-agents and may be followed
by all crawlers, provided it is not disallowed.
Order of precedence for group-member records
At a group-member level, in particular for allow
and
disallow
directives, the most specific rule based on the
length of the [path]
entry will trump the less specific (shorter) rule.
The order of precedence for rules with wildcards is undefined.
Sample situations | |
---|---|
http://example.com/page |
Allow: Disallow: Verdict: allow |
http://example.com/folder/page |
Allow: Disallow: Verdict: allow |
http://example.com/page.htm |
Allow: Disallow: Verdict: undefined |
http://example.com/ |
Allow: Disallow: Verdict: allow |
http://example.com/page.htm |
Allow: Disallow: Verdict: disallow |