Tuesday, July 02, 2019
Yesterday we announced that we're
open-sourcing Google's production robots.txt parser.
It was an exciting moment that paves the road for potential Search open sourcing projects in the
future! Feedback is helpful, and we're eagerly collecting questions from
developers and
webmasters alike. One question
stood out, which we'll address in this post:
Why isn't a code handler for other rules like crawl-delay included in the code?
The internet draft we published yesterday provides an
extensible architecture for rules that are not part of the standard. This means that if a
crawler wanted to support their own line like unicorns: allowed
,
they could. To demonstrate how this would look in a parser, we included a very common line,
sitemap, in our open-source robots.txt parser.
While open-sourcing our parser library, we analyzed the usage of robots.txt rules. In particular,
we focused on rules unsupported by the internet draft, such as
crawl-delay
, nofollow
, and
noindex
. Since these rules were never documented by Google,
naturally, their usage in relation to Googlebot is very low. Digging further, we saw their usage
was contradicted by other rules in all but 0.001% of all robots.txt files on the internet.
These mistakes hurt websites' presence in Google's search results in ways we don't think
webmasters intended.
In the interest of maintaining a healthy ecosystem and preparing for potential future open source
releases, we're retiring all code that handles unsupported and unpublished rules (such as
noindex
) on September 1, 2019. For those of you who relied on the
noindex
indexing rule in the
robots.txt
file, which controls crawling, there are a number of
alternative options:
-
noindex
in robots<