What Does robots.txt Actually Do?

What robots.txt does and doesn't do. How crawlers use it, why it's advisory not enforceable, and the limits of robots.txt.

robots.txt is a plain text file that tells web crawlers which parts of your site they can and cannot access. It lives at the root of your domain (https://example.com/robots.txt) and is the first thing a well-behaved crawler checks before it starts crawling.

That's the simple version. The reality involves important nuances about what robots.txt actually controls, what it doesn't control, and why the gap between those two matters.

How Crawlers Use robots.txt

When a search engine crawler like Googlebot visits your site, it follows a specific sequence.

1

Crawler requests /robots.txt

Before crawling any page, the crawler fetches https://example.com/robots.txt. This is hardcoded behavior -- the file must be at the root, at that exact path.

2

Crawler parses the rules

The crawler reads the file, finds the rules that apply to its User-agent, and builds a list of allowed and disallowed URL patterns.

3

Crawler respects the rules (or not)

For each URL it wants to visit, the crawler checks against the rules. If the URL is disallowed, a well-behaved crawler skips it. If allowed (or no matching rule exists), it proceeds.

4

Crawler caches the file

The crawler doesn't re-fetch robots.txt for every single URL. Google caches it for roughly 24 hours. Changes to your robots.txt take effect after the cache expires and the file is re-fetched.

A typical robots.txt looks like this:

User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /search?
Allow: /api/public/

Sitemap: https://example.com/sitemap.xml

This tells all crawlers: don't access /admin/, /api/ (except /api/public/), or URLs starting with /search?. And here's where to find the sitemap.

What robots.txt Controls

robots.txt controls access to URLs. Specifically, it tells crawlers which URL paths they may request from your server.

Blocking specific directories

Prevent crawlers from accessing entire sections of your site, like admin panels, internal tools, or staging areas.

Blocking specific file types

Using wildcard patterns, you can block crawlers from accessing PDFs, images, or other file types across your entire site.

Targeting specific crawlers

Different rules for different bots. Allow Googlebot but block AI training crawlers. Give Bingbot different access than everyone else.

Sitemap discovery

The Sitemap: directive points crawlers to your XML sitemap, helping them discover your content structure.

Managing crawl budget

On large sites, blocking low-value pages preserves crawl budget for the pages that matter for search visibility.

See your robots.txt in action

Test your current rules against real crawler User-agents and URLs to see what's blocked and what's allowed.

What robots.txt Does NOT Do

This is where most misunderstandings happen. robots.txt has hard limits, and assuming it does more than it actually does can create real problems.

It Does Not Prevent Indexing

This is the biggest misconception. Blocking a URL with Disallow prevents crawling, not indexing. If other websites link to a disallowed URL, Google knows it exists and may include it in search results -- just without a snippet or description.

You'll see entries like this in Google search results:

example.com/blocked-page/ A description for this result is not available because of this site's robots.txt.

To prevent indexing, use the noindex meta tag or the X-Robots-Tag HTTP header. And critically, the page must be crawlable for those directives to be seen.

It Does Not Hide Content

robots.txt is a publicly accessible file. Anyone can read it by visiting https://yoursite.com/robots.txt. If you list sensitive paths in your robots.txt, you've effectively published a directory of things you don't want people to find.

# Don't do this -- you're advertising these paths exist
User-agent: *
Disallow: /secret-admin-panel/
Disallow: /internal-financial-reports/
Disallow: /customer-database-export/

Security researchers and attackers routinely check robots.txt to discover interesting paths. Never use it as a security measure.

It Does Not Block All Bots

robots.txt is advisory, not enforceable. It's a polite request, not a locked door.

Well-behaved crawlers (Googlebot, Bingbot, legitimate AI crawlers) respect robots.txt rules. Malicious bots, scrapers, and bad actors ignore it entirely. If a bot wants to crawl your site regardless of your robots.txt, nothing in the file can stop it.

For actual access control, you need server-level solutions: authentication, IP blocking, rate limiting, or WAF rules.

It Does Not Control Link Equity

robots.txt has no effect on how PageRank or link equity flows through your site. Blocking a page from crawling doesn't prevent links pointing to that page from influencing search rankings. It just means Google can't crawl the destination.

For controlling link equity, use rel="nofollow" on individual links or the nofollow meta robots directive on entire pages.

The Advisory Nature: Why It Matters

The robots.txt protocol operates on an honor system established in 1994. There's no technical enforcement. The file is just text -- it doesn't trigger any server-level blocking. A crawler that chooses to ignore it faces no technical barrier.

This matters in two practical ways.

For search engine bots: Major search engines respect robots.txt because getting delisted from search results is a powerful incentive for compliance. Google, Bing, and others have strong business reasons to honor the protocol. You can generally trust them.

For everything else: Email harvesters, content scrapers, vulnerability scanners, and other malicious crawlers have no incentive to comply. They'll ignore your robots.txt entirely. Some won't even check for it.

robots.txt is access guidance, not access control

Think of robots.txt like a "Please don't enter" sign on an unlocked door. It works for guests who respect social norms. It does nothing against intruders.

The Enforcement Gap

The gap between what robots.txt asks for and what it can enforce has widened significantly with the rise of AI crawlers.

Many site owners added Disallow rules for AI training crawlers like GPTBot and CCBot. These specific crawlers generally respect the rules. But countless other scrapers feed AI training datasets, and they don't identify themselves with recognizable User-agent strings -- or they don't check robots.txt at all.

This is an active area of legal and technical debate. robots.txt remains the primary mechanism for expressing crawling preferences, but it's not a complete solution for controlling how your content gets used.

Audit your crawler blocking rules

Check whether your robots.txt effectively blocks the bots you're targeting and allows the ones you want.

What to Use Alongside robots.txt

For complete control over how your site interacts with crawlers and search engines, robots.txt is one tool in a larger toolkit.

GoalTool
Block crawling of specific pathsrobots.txt
Prevent pages from appearing in search resultsMeta robots noindex tag
Prevent indexing of non-HTML filesX-Robots-Tag HTTP header
Control link equity flowrel="nofollow" or meta robots nofollow
Actually block malicious botsWAF, rate limiting, IP blocking
Require authentication for accessServer-level auth (HTTP 401/403)
Remove a page from Google quicklyGoogle Search Console URL Removal
Help crawlers find your contentXML Sitemap (referenced in robots.txt)

robots.txt handles the "should this URL be crawled?" question. The other tools handle indexing, security, and access control -- concerns that robots.txt was never designed to address.

The Bottom Line

robots.txt does one thing well: it tells well-behaved crawlers which URLs they should and shouldn't request. It's fast, simple, and universally checked by legitimate bots.

It does not prevent indexing. It does not secure content. It does not block determined crawlers. Understanding these boundaries is the difference between a robots.txt that works for you and one that gives you a false sense of control.


robots.txt is a gentleman's agreement with web crawlers -- powerful when respected, invisible when ignored.

Test your robots.txt for free

Validate your robots.txt file instantly. Check directives, find crawling issues, and ensure search engines can access your site.