How to Read and Understand robots.txt

Learn to read any robots.txt file. Understand User-agent, Disallow, Allow, Sitemap, and Crawl-delay directives with real-world examples.

Every public website has (or should have) a robots.txt file. When you are debugging crawl issues, auditing a competitor, or just trying to understand how a site controls search engine access, you need to be able to read this file quickly and accurately.

This guide breaks down every directive you will encounter, explains the matching rules crawlers use, and walks through real-world patterns.

How to Find Any Site's robots.txt

The robots.txt file always lives at the root of a domain. To view it, append /robots.txt to the domain:

https://example.com/robots.txt
https://blog.example.com/robots.txt
https://shop.example.com/robots.txt

Each subdomain has its own robots.txt. The rules for example.com do not apply to blog.example.com. If a subdomain does not have a robots.txt file (you get a 404), crawlers treat it as having no restrictions.

Open the URL in your browser and you will see a plain text file. No special tools needed.

Understanding User-agent Blocks

A robots.txt file is organized into blocks. Each block starts with a User-agent line that specifies which crawler the following rules apply to.

User-agent: Googlebot
Disallow: /admin/
Allow: /admin/public/

User-agent: Bingbot
Disallow: /private/

User-agent: *
Disallow: /tmp/

Here is how to read this:

  • Googlebot cannot access /admin/ except for /admin/public/
  • Bingbot cannot access /private/ but can access everything else
  • All other crawlers (the * wildcard) cannot access /tmp/

The critical rule: a crawler looks for a block that matches its name. If it finds one, it follows those rules and ignores the * block entirely. If no specific block matches, it falls back to the * block.

So in the example above, Googlebot does not follow the Disallow: /tmp/ rule because it has its own dedicated block.

The Disallow Directive

Disallow tells a crawler not to access a URL path. The path is matched from the start of the URL path.

Disallow: /admin/          # Blocks /admin/, /admin/dashboard, /admin/users
Disallow: /search           # Blocks /search, /search?q=test, /searching
Disallow: /                 # Blocks everything
Disallow:                   # Blocks nothing (empty value = allow all)

Empty Disallow is not the same as no Disallow

Disallow: (with no path) explicitly allows everything. If a User-agent block has no Disallow directive at all, the result is the same -- everything is allowed. But an empty Disallow: is the conventional way to say "this crawler can access everything."

Pay attention to trailing slashes. Disallow: /admin matches any URL path starting with /admin, including /administrator and /admin-panel. Disallow: /admin/ only matches paths starting with /admin/.

The Allow Directive

Allow overrides a Disallow for a more specific path. It is most useful when you want to block a directory but permit certain files or subdirectories within it.

User-agent: *
Disallow: /private/
Allow: /private/press-releases/

This blocks /private/ but allows /private/press-releases/ and anything under it.

When an Allow and Disallow rule are equally specific (same path length), the Allow wins. When they differ in specificity, the more specific rule (the one with the longer path) takes precedence.

User-agent: Googlebot
Disallow: /downloads/
Allow: /downloads/public/
Disallow: /downloads/public/internal/

For the URL /downloads/public/internal/report.pdf:

  • Disallow: /downloads/ matches (length 11)
  • Allow: /downloads/public/ matches (length 19)
  • Disallow: /downloads/public/internal/ matches (length 28)

The most specific rule wins, so this URL is blocked.

Test rule precedence instantly

Paste your robots.txt and test any URL to see exactly which rule applies and why.

The Sitemap Directive

The Sitemap directive tells crawlers where to find your XML sitemap. Unlike other directives, it is not tied to any User-agent block. Place it anywhere in the file -- typically at the top or bottom.

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-products.xml

You can list multiple sitemaps. The URL must be absolute (including the protocol and domain). Relative paths like Sitemap: /sitemap.xml are not valid.

The Crawl-delay Directive

Crawl-delay tells a crawler to wait a specified number of seconds between requests. Not all crawlers respect it.

User-agent: *
Crawl-delay: 10

This asks crawlers to wait 10 seconds between each request. Googlebot ignores this directive entirely -- you need to set crawl rate in Google Search Console instead. Bingbot and some other crawlers do respect it.

Wildcards: * and $

The robots.txt standard supports two pattern-matching characters:

The * character matches any sequence of characters (including an empty sequence).

Disallow: /*.pdf$      # Blocks any URL ending in .pdf
Disallow: /archive/*/  # Blocks paths like /archive/2023/, /archive/old/
Disallow: /*?           # Blocks any URL containing a query string

The $ character marks the end of a URL. Without it, a pattern matches any URL that starts with the given path. With it, the URL must end at that point.

Disallow: /news$    # Blocks /news but NOT /news/ or /news/article
Disallow: /*.php$   # Blocks any URL ending in .php

Common Patterns in the Wild

Here are patterns you will see frequently when reading robots.txt files across the web.

Blocking search result pages:

Disallow: /search
Disallow: /*?s=
Disallow: /*?q=

Sites do this to prevent search engines from indexing their own internal search results, which are low-quality pages that duplicate existing content.

Blocking user-generated content paths:

Disallow: /profile/*/drafts
Disallow: /user/*/settings

This protects user-specific pages that should not appear in search results.

Blocking duplicate content from filters and sorting:

Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*&page=

E-commerce sites frequently block filtered and paginated URLs to avoid thousands of near-duplicate pages in the index.

Blocking development and staging artifacts:

Disallow: /api/
Disallow: /_next/
Disallow: /wp-json/
Disallow: /debug/

Decode any robots.txt file

Paste a robots.txt file and see a structured breakdown of every rule, who it affects, and what it blocks.

Reading Multi-Block Files

Large sites often have complex robots.txt files with many User-agent blocks. Here is how to read them efficiently:

  1. Find the relevant User-agent block. If you are debugging Googlebot behavior, look for User-agent: Googlebot or User-agent: Googlebot-Image first. If none exists, fall back to User-agent: *.

  2. Read only that block. Ignore rules in other blocks -- they do not apply to the crawler you are investigating.

  3. Check for the most specific match. If a URL matches both a Disallow and an Allow rule, the longer (more specific) pattern wins.

  4. Look at the Sitemap directives. These apply globally regardless of which User-agent block they appear near.

  5. Check for comments. Good robots.txt files include # comments explaining why certain rules exist. These give you context about the site's intent.

Quick check

If you just want to know whether a specific page is blocked, grab the robots.txt file and use a testing tool. Manually tracing rules across a 200-line file is error-prone.

Non-Standard Directives

You may encounter directives that are not part of the official standard. Most crawlers ignore them, but it is useful to know what they mean:

  • Host: example.com -- Used by Yandex to specify the preferred domain
  • Clean-param: utm_source -- Used by Yandex to indicate URL parameters that do not change page content
  • Request-rate: 1/10 -- An older alternative to Crawl-delay (rarely used)

When you see unfamiliar directives, they are almost always bot-specific extensions. They do not affect Googlebot or Bingbot behavior.


Once you can read robots.txt, you can debug any crawl issue.

Test your robots.txt for free

Validate your robots.txt file instantly. Check directives, find crawling issues, and ensure search engines can access your site.