robots.txt vs Meta Robots Tags: Which to Use
The difference between robots.txt and meta robots tags (noindex, nofollow). When to use each, and why using the wrong one can hurt your SEO.
People often confuse robots.txt with meta robots tags. Both control how search engines interact with your site, but they operate at completely different levels. Using the wrong one can mean pages stay indexed when you want them gone, or pages disappear from search when you want them found.
Here's the distinction that matters: robots.txt controls crawling. Meta robots tags control indexing. These are not the same thing.
What robots.txt Does
robots.txt tells crawlers which URLs they can and cannot request. It sits at the root of your domain (/robots.txt) and acts as a gatekeeper before any crawling begins.
User-agent: *
Disallow: /admin/
Disallow: /staging/
When Googlebot encounters this file, it won't fetch any URLs under /admin/ or /staging/. It never downloads those pages, never parses their content, never follows their links.
But here's the critical part: Disallow does not mean "don't index." If other sites link to a disallowed URL, Google can still index it based on the anchor text and link context alone. You'll see the URL in search results with the note "No information is available for this page."
What Meta Robots Tags Do
The meta robots tag is an HTML element that tells search engines what to do with a page after they've crawled it.
<meta name="robots" content="noindex, nofollow">
Common directives include:
noindex
Don't include this page in search results.
nofollow
Don't follow any links on this page.
none
Shorthand for noindex and nofollow combined.
noarchive
Don't show a cached version of this page.
nosnippet
Don't show a text snippet or video preview in search results.
The meta tag goes inside the <head> of each HTML page. The crawler must actually fetch and parse the page to see it.
The Critical Difference
This is where most people get it wrong.
If you block a page with Disallow in robots.txt, Google cannot see a noindex tag on that page. You've blocked the crawler from reading the page, so it never encounters the directive telling it not to index.
# robots.txt
User-agent: *
Disallow: /secret-page/
<!-- /secret-page/ — Googlebot never sees this -->
<meta name="robots" content="noindex">
The noindex tag is useless here. Google can't access the page to read it. And paradoxically, the page might still appear in search results if external links point to it.
Disallow + noindex is self-defeating
If your goal is to keep a page out of Google's index, don't block it with robots.txt. You must allow crawling so Google can see and obey the noindex meta tag.
Check for conflicting directives
Our tester identifies cases where your robots.txt blocks pages that need to be crawled for noindex to work.
When to Use robots.txt
Use robots.txt when you want to:
- Save crawl budget. Block crawlers from wasting time on low-value pages like faceted navigation, internal search results, or print-friendly versions.
- Prevent server overload. Stop aggressive crawlers from hammering resource-intensive endpoints.
- Block specific bots. Deny access to AI training crawlers, scrapers, or specific search engines.
- Keep private areas uncrawled. Stop crawlers from accessing admin panels, staging environments, or API endpoints (though never rely on this for security).
User-agent: *
Disallow: /api/
Disallow: /search?
Disallow: /print/
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
When to Use Meta Robots Tags
Use meta robots tags when you want to:
- Prevent indexing. The
noindexdirective is the most reliable way to keep a page out of search results. - Control link equity flow. Use
nofollowto tell search engines not to pass ranking signals through specific pages. - Manage cached versions. Use
noarchiveto prevent Google from showing a cached snapshot. - Control snippets. Use
nosnippetormax-snippetto manage how your page appears in search results.
<!-- Don't index this page, but follow its links -->
<meta name="robots" content="noindex, follow">
<!-- Target a specific bot -->
<meta name="googlebot" content="noindex">
The Third Option: X-Robots-Tag HTTP Header
There's a third mechanism most people forget about. The X-Robots-Tag HTTP header does everything the meta robots tag does, but it works for any file type -- PDFs, images, JSON responses, anything that has an HTTP response.
HTTP/1.1 200 OK
X-Robots-Tag: noindex, nofollow
Content-Type: application/pdf
This is the only way to noindex a PDF or image. You can't put a <meta> tag inside a PDF file. And you shouldn't block these with robots.txt if you want Google to actually process the noindex.
Using Both Together
The right approach often involves using both mechanisms in coordination.
Scenario: Block AI crawlers, deindex old content.
# robots.txt — block AI crawlers entirely
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
<!-- On old blog posts you want removed from Google -->
<meta name="robots" content="noindex">
Scenario: Save crawl budget on filter pages, but let Google know not to index them.
<!-- On /products?color=red&size=large pages -->
<meta name="robots" content="noindex, follow">
Don't block these with robots.txt. Let Google crawl them so it sees the noindex tag. The follow directive lets Google still discover products linked from these filter pages.
Audit your crawl control setup
Find mismatches between your robots.txt rules and meta robots tags before they cause SEO problems.
Decision Guide
Use this to pick the right mechanism for your situation.
| Goal | Use This |
|---|---|
| Block a page from being crawled | robots.txt Disallow |
| Remove a page from search results | Meta robots noindex |
| Block a specific bot (AI, scraper) | robots.txt User-agent rules |
| Stop link equity flowing through a page | Meta robots nofollow |
| Prevent indexing of a PDF or image | X-Robots-Tag HTTP header |
| Save crawl budget on low-value pages | robots.txt Disallow |
| Hide cached version of a page | Meta robots noarchive |
| Block crawling AND prevent indexing | Allow crawling + noindex tag |
That last row is counterintuitive but essential. If you want a page both uncrawled and unindexed, you need to allow crawling first so the noindex tag gets seen. Once Google processes the noindex and drops the page, you can then add the Disallow if you want to save crawl budget going forward.
Common Mistakes
Mistake 1: Using Disallow to deindex pages. This is the most common error. Disallow prevents crawling, not indexing. Pages can still appear in search results.
Mistake 2: Blocking JavaScript or CSS with robots.txt. Google needs to render your pages. Blocking render-critical resources makes Google see a broken page, which hurts your rankings.
Mistake 3: Using both Disallow and noindex simultaneously. If robots.txt blocks the page, the noindex tag is invisible to crawlers. Pick one approach: usually, allow crawling and use noindex.
Mistake 4: Forgetting about X-Robots-Tag for non-HTML content. Meta tags only work in HTML. For PDFs, images, and other files, use the X-Robots-Tag header.
Related Articles
robots.txt and meta robots are partners, not alternatives -- use the right tool for the right job.
Test your robots.txt for free
Validate your robots.txt file instantly. Check directives, find crawling issues, and ensure search engines can access your site.