Does robots.txt Prevent Indexing? (No, and Here's Why)

Why blocking a URL in robots.txt does not prevent it from appearing in Google search results. The difference between crawling and indexing, and what to use instead.

One of the most common misconceptions in SEO is that blocking a URL in robots.txt prevents it from being indexed by Google. It does not. If you add Disallow: /secret-page/ to your robots.txt, Google will stop crawling that page, but it may still appear in search results. This catches people off guard, and the confusion has led to countless pages being indexed that site owners thought were hidden.

This article explains why robots.txt does not prevent indexing, what happens when you block a URL, and what to use instead.

What robots.txt Actually Does

robots.txt controls crawling, not indexing. These are two different things.

Crawling is when a search engine bot visits your page and downloads its content. When you block a URL in robots.txt, you tell the bot: "Do not visit this URL. Do not download the HTML."

Indexing is when a search engine adds a URL to its search results database. A URL can be indexed even if the search engine has never crawled it.

When you block a URL with robots.txt, here is what happens:

  1. Googlebot sees the URL (through links, sitemaps, or other discovery methods)
  2. Googlebot checks robots.txt and finds the URL is disallowed
  3. Googlebot does not crawl the page (does not download the HTML)
  4. But Google still knows the URL exists
  5. If other pages link to that URL, Google may still index it

The indexed result will appear in search results as a URL with little or no snippet. Google shows something like:

https://example.com/secret-page/ No information is available for this page. Learn why

The URL is indexed. It appears in search results. Users can click on it. The only thing missing is the page description, because Google never crawled the page to get one.

Why This Happens

Google's index is not just a list of pages it has crawled. It is a database of URLs that Google knows about from any source:

  • Links from other pages. If page A links to page B, and page A is crawled, Google discovers page B's URL even if page B is blocked by robots.txt.
  • Sitemaps. If you list a URL in your sitemap, Google knows about it.
  • External links. If another website links to your page, Google discovers the URL when it crawls that site.
  • Google Search Console. If you have ever submitted the URL or it has appeared in Search Console reports, Google knows about it.
  • Previous crawls. If Google crawled the page before you added the robots.txt block, it remembers the URL.

Google distinguishes between "knowing about a URL" and "having the content of a URL." Blocking with robots.txt prevents the second but not the first. And Google can index URLs it knows about even without having their content.

The Correct Way to Prevent Indexing

If you want a page completely absent from search results, you need one of these approaches:

noindex meta tag

Add this to the page's <head>:

<meta name="robots" content="noindex">

This tells Google: "You can crawl this page, but do not include it in search results." Google will fetch the page, see the noindex directive, and remove it from (or not add it to) the index.

Important: For the noindex tag to work, Googlebot must be able to crawl the page. If you block the page in robots.txt AND add a noindex tag, Googlebot will never see the noindex tag because it cannot crawl the page. The URL may still appear in search results as described above.

This is the catch-22 that confuses many people: to prevent indexing with noindex, you must allow crawling.

X-Robots-Tag HTTP header

If you cannot add a meta tag to the HTML (for non-HTML files like PDFs, images, or API responses), use the X-Robots-Tag HTTP header:

HTTP/1.1 200 OK
X-Robots-Tag: noindex

This works the same as the meta tag but is set at the server level. It is useful for file types where you cannot embed HTML meta tags.

Removing the page entirely

If the page should not exist at all, delete it and return a 404 or 410 status code. Google will eventually drop the URL from its index after encountering the error response a few times.

  • 404 (Not Found): Standard "page does not exist" response. Google will remove it from the index, though this can take weeks.
  • 410 (Gone): Explicitly tells Google the page has been permanently removed. Google processes 410 responses faster than 404s.

Google's URL Removal Tool

For immediate (temporary) removal, use the URL Removal Tool in Google Search Console. This hides the URL from search results for about 6 months. During that time, you should implement a permanent solution (noindex or page deletion). The removal tool is a stopgap, not a permanent fix.

When to Use robots.txt vs. noindex

| Goal | Use robots.txt | Use noindex | |---|---|---| | Prevent crawling (save crawl budget) | Yes | No (page is still crawled) | | Prevent indexing (hide from search results) | No | Yes | | Block a directory of non-public files | Yes (they will not be crawled or snippeted) | Also works, but every file needs the tag/header | | Hide a page from search results completely | No | Yes | | Prevent AI training on your content | Yes (with bot-specific rules) | Not applicable |

The key takeaway: robots.txt is for controlling crawling. noindex is for controlling indexing. They solve different problems.

For a comprehensive comparison, see our detailed guide on robots.txt vs. meta robots.

What About Disallow + noindex Together?

As mentioned above, this combination does not work. If you block a URL in robots.txt, Googlebot cannot crawl the page and therefore cannot see the noindex tag. The robots.txt block prevents the noindex from being discovered.

Google's John Mueller has confirmed this multiple times: you need to choose one or the other.

If your goal is preventing indexing: Remove the Disallow rule from robots.txt and add a noindex tag to the page. Allow crawling so Google can see the noindex directive.

If your goal is preventing crawling (and you accept that the URL might still appear in the index): Keep the Disallow rule. Understand that the URL may still show up in search results without a snippet.

Real-World Examples

The private admin page

You have an admin page at /admin/dashboard/ that you block in robots.txt. An internal link to the admin page exists in your site's footer (maybe accidentally, or through a CMS that auto-generates navigation). Google discovers the URL through that link. The URL appears in search results as a bare URL with no description.

Fix: Remove the robots.txt block. Add a noindex meta tag. Better yet, protect the page with authentication (login required) and add noindex.

The staging environment

You block your entire staging site with Disallow: /. A developer shares a staging URL in a public forum or a link from the production site accidentally points to staging. Google discovers the staging URLs. They appear in search results.

Fix: Use a noindex meta tag or X-Robots-Tag header on the staging environment. Better yet, protect staging with HTTP authentication (basic auth) so Google cannot access it at all.

The PDF file

You block /documents/ in robots.txt because it contains internal PDFs. Someone links to one of those PDFs from an external website. Google discovers the URL and indexes it (without content).

Fix: If the PDFs should not be indexed, serve them with an X-Robots-Tag: noindex header. If they should not be accessible at all, move them behind authentication.

The noindex-in-robots.txt experiment

Google briefly supported a noindex directive in robots.txt as an unofficial extension. This was never part of the official robots.txt specification, and Google officially dropped support for it. Do not use Noindex: in your robots.txt -- it will be ignored. Use the meta tag or X-Robots-Tag header instead.

The Historical Confusion

Part of the confusion comes from the word "block." When people say they want to "block" a page, they usually mean "prevent it from appearing in search results." But in robots.txt terminology, "block" means "prevent crawling." These are different outcomes, and the terminology mismatch causes the misunderstanding.

The other source of confusion is that robots.txt used to be more effective at preventing indexing in practice, simply because search engines were less sophisticated about indexing uncrawled URLs. In the early days of the web, if Google could not crawl a page, it usually did not know about it. Today, Google knows about URLs from many sources beyond direct crawling, making robots.txt less effective as an indexing control.

How to Check If Blocked Pages Are Indexed

Search for the URL directly in Google:

site:example.com/secret-page/

Or check Google Search Console's Coverage/Pages report. Look for URLs with the status "Indexed, though blocked by robots.txt." This status explicitly tells you that Google has indexed a URL that your robots.txt is blocking.

If you see this status, it confirms the problem described in this article. The solution is to remove the robots.txt block and add a noindex directive instead.

Summary

robots.txt prevents crawling, not indexing. Blocking a URL in robots.txt stops Googlebot from visiting the page, but Google may still index the URL if it discovers it through links, sitemaps, or other sources. To prevent a page from appearing in search results, use a noindex meta tag or X-Robots-Tag header. Do not combine robots.txt blocking with noindex, because the block prevents Google from seeing the noindex tag.

Test your robots.txt

Check which pages are blocked by your robots.txt and verify that the right pages are accessible.

Test Your robots.txt