Monday, July 01, 2019
For 25 years, the Robots Exclusion Protocol (REP) has been one of the most basic and critical components of the web. It allows website owners to exclude automated clients, for example web crawlers, from accessing their sites - either partially or completely.
In 1994, Martijn Koster (a webmaster himself) created the initial standard after crawlers were overwhelming his site. With more input from other webmasters, the REP was born, and it was adopted by search engines to help website owners manage their server resources easier.
However, the REP was never turned into an official Internet standard, which means that developers have interpreted the protocol somewhat differently over the years. And since its inception, the REP hasn't been updated to cover today's corner cases. This is a challenging problem for website owners because the ambiguous de-facto standard made it difficult to write the rules correctly.
We wanted to help website owners and developers create amazing experiences on the internet instead of worrying about how to control crawlers. Together with the original author of the protocol, webmasters, and other search engines, we've documented how the REP is used on the modern web, and submitted it to the IETF.
The proposed REP draft reflects over 20 years of real world experience of relying on robots.txt rules, used both by Googlebot and other major crawlers, as well as about half a billion websites that rely on REP. These fine grained controls give the publisher the power to decide what they'd like to be crawled on their site and potentially shown to interested users. It doesn't change the rules created in 1994, but rather defines essentially all undefined scenarios for robots.txt parsing and matching, and extends it for the modern web. Notably:
- Any URI based transfer protocol can use robots.txt. For example, it's not limited to HTTP anymore and can be used for FTP or CoAP as well.
- Developers must parse at least the first 500 kibibytes of a robots.txt. Defining a maximum file size ensures that connections are not open for too long, alleviating unnecessary strain on servers.
- A new maximum caching time of 24 hours or cache directive value if available, gives website owners the flexibility to up