robots.txt Syntax Reference

Complete robots.txt syntax reference. Every directive, pattern, and rule explained with examples. Bookmark this page.

This is the complete syntax reference for the robots.txt file format. Every directive, every pattern, every rule -- documented with examples you can copy and use immediately.

The robots.txt specification is defined in RFC 9309, published in September 2022. This reference covers the standard plus commonly supported extensions.

File Requirements

Before you write a single directive, the file itself must meet these requirements:

  • Filename: Exactly robots.txt (lowercase)
  • Location: Domain root, accessible at https://yourdomain.com/robots.txt
  • Encoding: UTF-8
  • Content type: text/plain
  • Maximum size: 500KB (per Google's implementation -- content beyond this limit is ignored)
  • Protocol: Each subdomain and protocol needs its own file. https://example.com/robots.txt and https://blog.example.com/robots.txt are separate files with separate rules.
# This file must be at:
# https://yourdomain.com/robots.txt
# NOT https://yourdomain.com/pages/robots.txt
# NOT https://yourdomain.com/Robots.txt

File size matters

Google stops processing robots.txt files larger than 500KB. If your file exceeds this limit, everything after the 500KB mark is ignored, and Google treats any unprocessed rules as allowed. Keep your file concise.

User-agent

The User-agent directive specifies which crawler the following rules apply to. It starts a new rule group.

Syntax: User-agent: <crawler-name>

# Target all crawlers
User-agent: *

# Target a specific crawler
User-agent: Googlebot

# Target multiple crawlers with the same rules
User-agent: Googlebot
User-agent: Bingbot
Disallow: /private/

Rules:

  • The value is case-insensitive. Googlebot, googlebot, and GOOGLEBOT all match the same crawler.
  • * is a wildcard that matches all crawlers.
  • If a crawler finds a group that matches its specific name, it uses that group and ignores the * group.
  • If no specific group matches, the crawler falls back to the * group.
  • If no * group exists and no specific group matches, everything is allowed.
  • Multiple User-agent lines can precede a set of rules, meaning those rules apply to all listed agents.
# Googlebot uses THIS block (specific match)
User-agent: Googlebot
Disallow: /google-only-blocked/

# All other crawlers use THIS block
User-agent: *
Disallow: /private/

Disallow

The Disallow directive tells a crawler not to access URLs matching the specified path.

Syntax: Disallow: <path>

User-agent: *
# Block a directory
Disallow: /admin/

# Block a specific page
Disallow: /secret-page.html

# Block everything
Disallow: /

# Block nothing (empty value = allow all)
Disallow:

Rules:

  • The path is case-sensitive. /Admin/ and /admin/ are different paths.
  • The path must start with /.
  • Disallow: with an empty value means nothing is disallowed (allow everything).
  • Disallow: / blocks the entire site for that user agent.
  • The path matches from the start of the URL path. Disallow: /admin blocks /admin, /admin/, /admin/settings, and /administration.

Test your Disallow rules

Paste your robots.txt and test specific URLs to see if they are blocked or allowed.

Allow

The Allow directive permits crawling of a URL that would otherwise be blocked by a Disallow rule. It is used to create exceptions within blocked directories.

Syntax: Allow: <path>

User-agent: *
Disallow: /account/
Allow: /account/login/

# /account/settings → blocked
# /account/login/ → allowed
# /account/login/reset → allowed

Rules:

  • Same path-matching rules as Disallow.
  • The path is case-sensitive.
  • Allow is defined in RFC 9309 and is supported by all major crawlers.
  • When Allow and Disallow rules conflict, the most specific rule (longest matching path) wins.
User-agent: *
Disallow: /docs/
Allow: /docs/public/
Disallow: /docs/public/drafts/

# /docs/ → blocked
# /docs/public/ → allowed
# /docs/public/guide.html → allowed
# /docs/public/drafts/ → blocked

Rule Matching and Precedence

When multiple rules match a URL, crawlers use the longest matching path to determine which rule applies. This is the standard defined in RFC 9309.

User-agent: *
Disallow: /
Allow: /public/
Disallow: /public/drafts/
Allow: /public/drafts/preview/

For the URL /public/drafts/preview/page.html:

  • Disallow: / matches (1 character)
  • Allow: /public/ matches (8 characters)
  • Disallow: /public/drafts/ matches (15 characters)
  • Allow: /public/drafts/preview/ matches (23 characters) -- this wins

The URL is allowed because the longest matching rule is an Allow.

When paths are equal length

If an Allow and Disallow rule match with the same path length, the Allow rule takes precedence per RFC 9309. In practice, avoid this situation by making your rules more specific.

Wildcards

The robots.txt specification supports two wildcard characters in Disallow and Allow paths.

The * wildcard

Matches any sequence of characters (including an empty sequence).

User-agent: *
# Block all PDF files
Disallow: /*.pdf

# Block all URLs containing "print"
Disallow: /*print*

# Block URLs with query parameters
Disallow: /*?

# Block all files in any "tmp" directory
Disallow: /*/tmp/

The $ end-of-URL marker

Anchors the match to the end of the URL. Without $, a path like Disallow: /*.pdf would also block /document.pdf.html. Adding $ makes it exact.

User-agent: *
# Block only URLs that END in .pdf
Disallow: /*.pdf$

# Block the exact path /about (not /about-us or /about/team)
Disallow: /about$

# Block URLs ending with a query parameter value
Disallow: /*&format=rss$

Comparison:

Disallow: /*.json     # blocks /data.json AND /data.json.bak
Disallow: /*.json$    # blocks /data.json but NOT /data.json.bak

Wildcard support varies

The * and $ wildcards are supported by Google, Bing, and most major crawlers. They are documented in Google's robots.txt specification but are technically outside the scope of RFC 9309. In practice, all crawlers you care about support them.

Sitemap

The Sitemap directive tells crawlers where to find your XML sitemap. Unlike other directives, it is not tied to any User-agent block.

Syntax: Sitemap: <absolute-url>

User-agent: *
Disallow: /admin/

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-blog.xml
Sitemap: https://example.com/sitemap-products.xml

Rules:

  • The URL must be absolute (including https://).
  • You can list multiple Sitemap directives.
  • The directive can appear anywhere in the file -- top, bottom, or between groups.
  • The sitemap URL does not need to be on the same domain as the robots.txt file, though this is unusual.

Crawl-delay

The Crawl-delay directive requests that a crawler wait a specified number of seconds between requests.

Syntax: Crawl-delay: <seconds>

User-agent: Bingbot
Crawl-delay: 10

User-agent: *
Crawl-delay: 5

Rules:

  • The value is in seconds.
  • Google ignores Crawl-delay entirely. Use Google Search Console to adjust Googlebot's crawl rate instead.
  • Bing, Yandex, and some other crawlers respect it.
  • This directive is not part of RFC 9309 -- it's a widely adopted extension.

Host

The Host directive specifies the preferred domain for a site (with or without www).

Syntax: Host: <domain>

User-agent: *
Disallow: /admin/

Host: www.example.com

Rules:

  • This directive was primarily used by Yandex.
  • Google and Bing ignore it. Use canonical tags and 301 redirects for domain preference instead.
  • This directive is not part of RFC 9309.

Validate your robots.txt syntax

Paste your robots.txt and check for errors, warnings, and best-practice violations.

Comments

Lines starting with # are comments. Crawlers ignore them.

# This is a comment on its own line
User-agent: *  # This is an inline comment
Disallow: /admin/  # Block admin area

Rules:

  • Comments can appear on their own line or at the end of a directive line.
  • Everything after # on a line is ignored by the parser.
  • Use comments to document why rules exist. Future you will thank present you.

Whitespace and Formatting

# Blank lines separate groups
User-agent: Googlebot
Disallow: /private/

User-agent: Bingbot
Disallow: /private/

# Spaces around the colon are optional but conventional
User-agent: *
Disallow: /admin/

# These are equivalent:
# Disallow: /admin/
# Disallow:/admin/
# Disallow:  /admin/

Rules:

  • Blank lines between groups are optional but improve readability.
  • Leading and trailing whitespace around values is trimmed.
  • Empty lines within a group may end that group (behavior varies by crawler). Best practice: keep all rules in a group together with no blank lines between them.

Quick Reference Table

DirectiveSupported By
User-agentAll crawlers (RFC 9309)
DisallowAll crawlers (RFC 9309)
AllowAll crawlers (RFC 9309)
SitemapAll major crawlers
Crawl-delayBing, Yandex (NOT Google)
HostYandex only
* wildcardGoogle, Bing, most crawlers
$ end markerGoogle, Bing, most crawlers

HTTP Status Codes and robots.txt

How crawlers handle different HTTP responses when fetching robots.txt:

  • 200 OK -- Rules are parsed and applied normally.
  • 3xx Redirect -- Crawlers follow redirects (up to a limit) and use the final response.
  • 4xx Client Error (including 404) -- No robots.txt exists. Crawlers assume everything is allowed.
  • 5xx Server Error -- Crawlers assume the site is temporarily unavailable. Google will limit crawling. If 5xx errors persist for 30+ days, Google uses the last cached version, then treats it as fully allowed.
# If /robots.txt returns 404, crawlers will:
# - Assume no restrictions exist
# - Crawl everything on the site
# This is why having a robots.txt file matters

The definitive robots.txt syntax reference. Bookmark it, share it, come back to it.

Test your robots.txt for free

Validate your robots.txt file instantly. Check directives, find crawling issues, and ensure search engines can access your site.