Useful robots.txt rules
Here are some common useful robots.txt rules:
| Useful rules | |
|---|---|
| Disallow crawling of the entire site |
Keep in mind that in some situations URLs from the site may still be indexed, even if they haven't been crawled. User-agent: * Disallow: / |
| Disallow crawling of a directory and its contents |
Append a forward slash to the directory name to disallow crawling of a whole directory. User-agent: * Disallow: /calendar/ Disallow: /junk/ Disallow: /books/fiction/contemporary/ |
| Allow access to a single crawler |
Only User-agent: Googlebot-news Allow: / User-agent: * Disallow: / |
| Allow access to all but a single crawler |
User-agent: Unnecessarybot Disallow: / User-agent: * Allow: / |
|
Disallow crawling of a single web page |
For example, disallow the User-agent: * Disallow: /useless_file.html Disallow: /junk/other_useless_file.html |
|
Disallow crawling of the whole site except a subdirectory |
Crawlers may only access the User-agent: * Disallow: / Allow: /public/ |
|
Block a specific image from Google Images |
For example, disallow the User-agent: Googlebot-Image Disallow: /images/dogs.jpg |
|
Block all images on your site from Google Images |
Google can't index images and videos without crawling them. User-agent: Googlebot-Image Disallow: / |
|
Disallow crawling of files of a specific file type |
For example, disallow for crawling all User-agent: Googlebot Disallow: /*.gif$ |
|
Disallow crawling of an entire site, but allow |
This implementation hides your pages from search results, but the
User-agent: * Disallow: / User-agent: Mediapartners-Google Allow: / |
Use the * and $ wildcards to match URLs that end with a
specific string
|
For example, disallow all User-agent: Googlebot Disallow: /*.xls$ |