On a Sandy Beach: robots.txt

Mike's Notes

I need to learn about and use robots.txt files. The robots.txt goes to the root of every website. e.g.

www.example.com/robots.txt
https://example.com:8181/robots.txt
ftp://example.com/robots.txt

Resources

References

Reference

Repository

Home > Ajabbi Research > Library >
Home > Handbook >

Last Updated

18/04/2025

robots.txt

By: Mike Peters

On a Sandy Beach: 20/04/2025

Mike is the inventor and architect of Pipi and the founder of Ajabbi.

Description

"robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.

The standard, developed in 1994, relies on voluntary compliance. Malicious bots can use the file as a directory of which pages to visit, though standards bodies discourage countering this with security through obscurity. Some archival sites ignore robots.txt. The standard was used in the 1990s to mitigate server overload. In the 2020s many websites began denying bots that collect information for generative artificial intelligence.

The "robots.txt" file can be used in conjunction with sitemaps, another robot inclusion standard for websites." - Wikipedia

Maximum size of a robots.txt file

The Robots Exclusion Protocol requires crawlers to parse at least 500 kibibytes (KiB) of robots.txt files, which Google maintains as a 500 kibibyte file size restriction for robots.txt files .

Examples

wildcard * stands for all robots

This example tells all robots to stay out of a website.

User-agent: *
Disallow: /:

This example tells all robots they can visit all files.

User-agent: *
Allow: /

This example tells all robots not to enter three directories.

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/

This example tells all robots to stay away from one specific file.

User-agent: *
Disallow: /directory/file.html

If you don't want crawlers to access sections of your site, you can create a robots.txt file with appropriate rules. A robots.txt file is a simple text file containing rules about which crawlers may access which parts of a site. For example, the robots.txt file for example.com may look like this.

# This robots.txt file controls crawling of URLs under https://example.com.
# All crawlers are disallowed to crawl files in the "includes" directory, such
# as .css, .js, but Google needs them for rendering, so Googlebot is allowed
# to crawl them.
User-agent: *
Disallow: /includes/

User-agent: Googlebot
Allow: /includes/

Sitemap: https://example.com/sitemap.xml

On a Sandy Beach

Pages

robots.txt

Mike's Notes

Resources

References

Repository

Last Updated

robots.txt

Description

Maximum size of a robots.txt file

Examples

No comments:

Post a Comment