What Is robots.txt?

What robots.txt is, how it works, and why every website should have one. The complete introduction to the robots exclusion protocol.

What robots.txt actually is

A robots.txt file is a plain text file that sits at the root of your website and tells web crawlers which parts of your site they can and cannot access. When a crawler like Googlebot visits your site, it checks https://yoursite.com/robots.txt before crawling anything else.

The file follows the Robots Exclusion Protocol, a standard that has been around since 1994. Martijn Koster proposed it after web crawlers started hammering servers with requests. It was never formalized as an RFC until Google, along with Bing and other search engines, pushed to standardize it as RFC 9309 in 2022.

Here's what a basic robots.txt file looks like:

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /admin/public/

Sitemap: https://example.com/sitemap.xml

That file tells all crawlers to stay out of /admin/ and /private/, except for /admin/public/, and points them to the sitemap.

How crawlers use robots.txt

When a well-behaved crawler arrives at your domain, it follows a specific sequence:

1

Fetch the robots.txt file

The crawler requests https://yoursite.com/robots.txt before doing anything else. If it gets a 200 response, it parses the rules. If it gets a 404, it assumes everything is allowed.

2

Find the matching User-agent block

The crawler looks for a User-agent directive that matches its name. If there's no specific match, it falls back to the wildcard User-agent: * block.

3

Apply Allow and Disallow rules

The crawler checks each URL it wants to visit against the rules. If a URL matches a Disallow pattern, the crawler skips it. If it matches an Allow pattern, the crawler proceeds.

4

Crawl the permitted pages

The crawler visits only the pages it's allowed to access, following links and indexing content as it goes.

This entire process happens automatically. You don't need to do anything beyond placing the file at your domain root.

The basic structure

Every robots.txt file is built from a few simple directives:

# This is a comment
User-agent: Googlebot
Disallow: /no-google/
Allow: /no-google/except-this/

User-agent: *
Disallow: /private/

Sitemap: https://example.com/sitemap.xml

User-agent

Specifies which crawler the following rules apply to. Use * for all crawlers, or a specific name like Googlebot or Bingbot.

Disallow

Tells the crawler not to access URLs matching this path. Disallow: /admin/ blocks everything under /admin/.

Allow

Overrides a Disallow for a more specific path. Useful for allowing a subdirectory inside a blocked directory.

Sitemap

Points crawlers to your XML sitemap. This directive can appear anywhere in the file and isn't tied to a specific User-agent block.

Each User-agent line starts a new block. All Disallow and Allow rules below it apply to that agent until the next User-agent line.

What robots.txt can and cannot do

This is where most people get confused. Here's the reality:

What it doesWhat it does NOT do
Tells crawlers which URLs to skipPrevents pages from appearing in search results
Manages crawl budget allocationHides content from the public
Points crawlers to your sitemapProtects sensitive data or passwords
Controls which bots access your siteGuarantees pages won't be indexed

The critical thing to understand: robots.txt is advisory, not enforcement. It's a polite request, not a locked door. Well-behaved crawlers like Googlebot and Bingbot respect it. Malicious bots, scrapers, and security scanners will ignore it entirely.

robots.txt does not prevent indexing

If other sites link to a page you've blocked in robots.txt, search engines can still index the URL (they just won't crawl the content). You'll see the page in search results with a message like "No information is available for this page." To truly prevent indexing, use a noindex meta tag or X-Robots-Tag HTTP header instead.

Who respects robots.txt

Major search engine crawlers follow robots.txt rules reliably:

  • Googlebot — Google's primary crawler
  • Bingbot — Microsoft Bing's crawler
  • Yandex — Russia's major search engine
  • Baiduspider — China's major search engine
  • DuckDuckBot — DuckDuckGo's crawler

Social media crawlers like facebookexternalhit and Twitterbot also generally respect robots.txt, though their behavior can vary.

AI training crawlers like GPTBot (OpenAI), Google-Extended (Google AI), and CCBot (Common Crawl) also check robots.txt, and many sites now block these specifically.

Check who's crawling your site

Test your robots.txt rules against any user agent to see exactly what's allowed and blocked.

Why robots.txt matters for SEO

Search engines have a finite crawl budget for your site. That's the number of pages they'll crawl in a given time period. A well-configured robots.txt helps you spend that budget on pages that matter.

Without robots.txt, crawlers waste time on pages that provide no SEO value: admin panels, search result pages, duplicate content, staging areas, and API endpoints. For small sites this rarely matters. For sites with thousands of pages, it's the difference between your important pages getting indexed quickly or sitting in a queue.

A misconfigured robots.txt, on the other hand, can block your entire site from being crawled. A single Disallow: / under User-agent: * will tell every search engine to stay away from everything.

# This blocks your ENTIRE site from all crawlers
User-agent: *
Disallow: /

That's a line that has accidentally tanked more sites' search traffic than almost any other technical SEO mistake.

Validate your robots.txt now

Catch blocking errors before they cost you search traffic. Test your rules against real crawler user agents.

Creating your first robots.txt

The file must be named exactly robots.txt (lowercase) and placed at your domain root. It must be accessible at https://yoursite.com/robots.txt. Not in a subdirectory. Not with a different name.

Here's a sensible starting point for most sites:

User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /search?
Disallow: /tmp/

Sitemap: https://yoursite.com/sitemap.xml

This blocks admin areas, API endpoints, internal search results, and temporary files while allowing everything else to be crawled. Adjust based on what your site actually has.

Always test after editing

A syntax error or misplaced directive in robots.txt can block pages you didn't intend to block. Always validate your file after making changes — especially before deploying to production.


Your robots.txt is the first thing crawlers read. Make sure it says what you mean.

Test your robots.txt for free

Validate your robots.txt file instantly. Check directives, find crawling issues, and ensure search engines can access your site.