robots.txt Syntax Reference
Complete robots.txt syntax reference. Every directive, pattern, and rule explained with examples. Bookmark this page.
This is the complete syntax reference for the robots.txt file format. Every directive, every pattern, every rule -- documented with examples you can copy and use immediately.
The robots.txt specification is defined in RFC 9309, published in September 2022. This reference covers the standard plus commonly supported extensions.
File Requirements
Before you write a single directive, the file itself must meet these requirements:
- Filename: Exactly
robots.txt(lowercase) - Location: Domain root, accessible at
https://yourdomain.com/robots.txt - Encoding: UTF-8
- Content type:
text/plain - Maximum size: 500KB (per Google's implementation -- content beyond this limit is ignored)
- Protocol: Each subdomain and protocol needs its own file.
https://example.com/robots.txtandhttps://blog.example.com/robots.txtare separate files with separate rules.
# This file must be at:
# https://yourdomain.com/robots.txt
# NOT https://yourdomain.com/pages/robots.txt
# NOT https://yourdomain.com/Robots.txt
File size matters
Google stops processing robots.txt files larger than 500KB. If your file exceeds this limit, everything after the 500KB mark is ignored, and Google treats any unprocessed rules as allowed. Keep your file concise.
User-agent
The User-agent directive specifies which crawler the following rules apply to. It starts a new rule group.
Syntax: User-agent: <crawler-name>
# Target all crawlers
User-agent: *
# Target a specific crawler
User-agent: Googlebot
# Target multiple crawlers with the same rules
User-agent: Googlebot
User-agent: Bingbot
Disallow: /private/
Rules:
- The value is case-insensitive.
Googlebot,googlebot, andGOOGLEBOTall match the same crawler. *is a wildcard that matches all crawlers.- If a crawler finds a group that matches its specific name, it uses that group and ignores the
*group. - If no specific group matches, the crawler falls back to the
*group. - If no
*group exists and no specific group matches, everything is allowed. - Multiple
User-agentlines can precede a set of rules, meaning those rules apply to all listed agents.
# Googlebot uses THIS block (specific match)
User-agent: Googlebot
Disallow: /google-only-blocked/
# All other crawlers use THIS block
User-agent: *
Disallow: /private/
Disallow
The Disallow directive tells a crawler not to access URLs matching the specified path.
Syntax: Disallow: <path>
User-agent: *
# Block a directory
Disallow: /admin/
# Block a specific page
Disallow: /secret-page.html
# Block everything
Disallow: /
# Block nothing (empty value = allow all)
Disallow:
Rules:
- The path is case-sensitive.
/Admin/and/admin/are different paths. - The path must start with
/. Disallow:with an empty value means nothing is disallowed (allow everything).Disallow: /blocks the entire site for that user agent.- The path matches from the start of the URL path.
Disallow: /adminblocks/admin,/admin/,/admin/settings, and/administration.
Test your Disallow rules
Paste your robots.txt and test specific URLs to see if they are blocked or allowed.
Allow
The Allow directive permits crawling of a URL that would otherwise be blocked by a Disallow rule. It is used to create exceptions within blocked directories.
Syntax: Allow: <path>
User-agent: *
Disallow: /account/
Allow: /account/login/
# /account/settings → blocked
# /account/login/ → allowed
# /account/login/reset → allowed
Rules:
- Same path-matching rules as
Disallow. - The path is case-sensitive.
Allowis defined in RFC 9309 and is supported by all major crawlers.- When
AllowandDisallowrules conflict, the most specific rule (longest matching path) wins.
User-agent: *
Disallow: /docs/
Allow: /docs/public/
Disallow: /docs/public/drafts/
# /docs/ → blocked
# /docs/public/ → allowed
# /docs/public/guide.html → allowed
# /docs/public/drafts/ → blocked
Rule Matching and Precedence
When multiple rules match a URL, crawlers use the longest matching path to determine which rule applies. This is the standard defined in RFC 9309.
User-agent: *
Disallow: /
Allow: /public/
Disallow: /public/drafts/
Allow: /public/drafts/preview/
For the URL /public/drafts/preview/page.html:
Disallow: /matches (1 character)Allow: /public/matches (8 characters)Disallow: /public/drafts/matches (15 characters)Allow: /public/drafts/preview/matches (23 characters) -- this wins
The URL is allowed because the longest matching rule is an Allow.
When paths are equal length
If an Allow and Disallow rule match with the same path length, the Allow rule takes precedence per RFC 9309. In practice, avoid this situation by making your rules more specific.
Wildcards
The robots.txt specification supports two wildcard characters in Disallow and Allow paths.
The * wildcard
Matches any sequence of characters (including an empty sequence).
User-agent: *
# Block all PDF files
Disallow: /*.pdf
# Block all URLs containing "print"
Disallow: /*print*
# Block URLs with query parameters
Disallow: /*?
# Block all files in any "tmp" directory
Disallow: /*/tmp/
The $ end-of-URL marker
Anchors the match to the end of the URL. Without $, a path like Disallow: /*.pdf would also block /document.pdf.html. Adding $ makes it exact.
User-agent: *
# Block only URLs that END in .pdf
Disallow: /*.pdf$
# Block the exact path /about (not /about-us or /about/team)
Disallow: /about$
# Block URLs ending with a query parameter value
Disallow: /*&format=rss$
Comparison:
Disallow: /*.json # blocks /data.json AND /data.json.bak
Disallow: /*.json$ # blocks /data.json but NOT /data.json.bak
Wildcard support varies
The * and $ wildcards are supported by Google, Bing, and most major crawlers. They are documented in Google's robots.txt specification but are technically outside the scope of RFC 9309. In practice, all crawlers you care about support them.
Sitemap
The Sitemap directive tells crawlers where to find your XML sitemap. Unlike other directives, it is not tied to any User-agent block.
Syntax: Sitemap: <absolute-url>
User-agent: *
Disallow: /admin/
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-blog.xml
Sitemap: https://example.com/sitemap-products.xml
Rules:
- The URL must be absolute (including
https://). - You can list multiple
Sitemapdirectives. - The directive can appear anywhere in the file -- top, bottom, or between groups.
- The sitemap URL does not need to be on the same domain as the robots.txt file, though this is unusual.
Crawl-delay
The Crawl-delay directive requests that a crawler wait a specified number of seconds between requests.
Syntax: Crawl-delay: <seconds>
User-agent: Bingbot
Crawl-delay: 10
User-agent: *
Crawl-delay: 5
Rules:
- The value is in seconds.
- Google ignores
Crawl-delayentirely. Use Google Search Console to adjust Googlebot's crawl rate instead. - Bing, Yandex, and some other crawlers respect it.
- This directive is not part of RFC 9309 -- it's a widely adopted extension.
Host
The Host directive specifies the preferred domain for a site (with or without www).
Syntax: Host: <domain>
User-agent: *
Disallow: /admin/
Host: www.example.com
Rules:
- This directive was primarily used by Yandex.
- Google and Bing ignore it. Use canonical tags and 301 redirects for domain preference instead.
- This directive is not part of RFC 9309.
Validate your robots.txt syntax
Paste your robots.txt and check for errors, warnings, and best-practice violations.
Comments
Lines starting with # are comments. Crawlers ignore them.
# This is a comment on its own line
User-agent: * # This is an inline comment
Disallow: /admin/ # Block admin area
Rules:
- Comments can appear on their own line or at the end of a directive line.
- Everything after
#on a line is ignored by the parser. - Use comments to document why rules exist. Future you will thank present you.
Whitespace and Formatting
# Blank lines separate groups
User-agent: Googlebot
Disallow: /private/
User-agent: Bingbot
Disallow: /private/
# Spaces around the colon are optional but conventional
User-agent: *
Disallow: /admin/
# These are equivalent:
# Disallow: /admin/
# Disallow:/admin/
# Disallow: /admin/
Rules:
- Blank lines between groups are optional but improve readability.
- Leading and trailing whitespace around values is trimmed.
- Empty lines within a group may end that group (behavior varies by crawler). Best practice: keep all rules in a group together with no blank lines between them.
Quick Reference Table
| Directive | Supported By |
|---|---|
| User-agent | All crawlers (RFC 9309) |
| Disallow | All crawlers (RFC 9309) |
| Allow | All crawlers (RFC 9309) |
| Sitemap | All major crawlers |
| Crawl-delay | Bing, Yandex (NOT Google) |
| Host | Yandex only |
| * wildcard | Google, Bing, most crawlers |
| $ end marker | Google, Bing, most crawlers |
HTTP Status Codes and robots.txt
How crawlers handle different HTTP responses when fetching robots.txt:
- 200 OK -- Rules are parsed and applied normally.
- 3xx Redirect -- Crawlers follow redirects (up to a limit) and use the final response.
- 4xx Client Error (including 404) -- No robots.txt exists. Crawlers assume everything is allowed.
- 5xx Server Error -- Crawlers assume the site is temporarily unavailable. Google will limit crawling. If 5xx errors persist for 30+ days, Google uses the last cached version, then treats it as fully allowed.
# If /robots.txt returns 404, crawlers will:
# - Assume no restrictions exist
# - Crawl everything on the site
# This is why having a robots.txt file matters
Related Articles
The definitive robots.txt syntax reference. Bookmark it, share it, come back to it.
Test your robots.txt for free
Validate your robots.txt file instantly. Check directives, find crawling issues, and ensure search engines can access your site.