robots.txt vs Meta Robots Tags

People often confuse robots.txt with meta robots tags. Both control how search engines interact with your site, but they operate at completely different levels. Using the wrong one can mean pages stay indexed when you want them gone, or pages disappear from search when you want them found. For the complete robots.txt reference, see our robots.txt Guide.

Here's the distinction that matters: robots.txt controls crawling. Meta robots tags control indexing. These are not the same thing.

What robots.txt Does

robots.txt tells crawlers which URLs they can and cannot request [1]. It sits at the root of your domain (/robots.txt) and acts as a gatekeeper before any crawling begins.

User-agent: *
Disallow: /admin/
Disallow: /staging/

When Googlebot encounters this file, it won't fetch any URLs under /admin/ or /staging/. It never downloads those pages, never parses their content, never follows their links.

But here's the critical part: Disallow does not mean "don't index." If other sites link to a disallowed URL, Google can still index it based on the anchor text and link context alone. You'll see the URL in search results with the note "No information is available for this page."

What Meta Robots Tags Do

The meta robots tag is an HTML element that tells search engines what to do with a page after they've crawled it.

<meta name="robots" content="noindex, nofollow">

Common directives include:

noindex

Don't include this page in search results.

nofollow

Don't follow any links on this page.

none

Shorthand for noindex and nofollow combined.

noarchive

Don't show a cached version of this page.

nosnippet

Don't show a text snippet or video preview in search results.

The meta tag goes inside the <head> of each HTML page. The crawler must actually fetch and parse the page to see it.

The Critical Difference

This is where most people get it wrong.

If you block a page with Disallow in robots.txt, Google cannot see a noindex tag on that page. You've blocked the crawler from reading the page, so it never encounters the directive telling it not to index.

# robots.txt
User-agent: *
Disallow: /secret-page/

<!-- /secret-page/ -- Googlebot never sees this -->
<meta name="robots" content="noindex">

The noindex tag is useless here. Google can't access the page to read it. And paradoxically, the page might still appear in search results if external links point to it.

Disallow + noindex is self-defeating

If your goal is to keep a page out of Google's index, don't block it with robots.txt. You must allow crawling so Google can see and obey the noindex meta tag.

Check for conflicting directives

Our tester identifies cases where your robots.txt blocks pages that need to be crawled for noindex to work.

Test Your robots.txt

When to Use robots.txt

Use robots.txt when you want to:

Save crawl budget. Block crawlers from wasting time on low-value pages like faceted navigation, internal search results, or print-friendly versions.
Prevent server overload. Stop aggressive crawlers from hammering resource-intensive endpoints.
Block specific bots. Deny access to AI training crawlers, scrapers, or specific search engines.
Keep private areas uncrawled. Stop crawlers from accessing admin panels, staging environments, or API endpoints (though never rely on this for security).

User-agent: *
Disallow: /api/
Disallow: /search?
Disallow: /print/

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

When to Use Meta Robots Tags

Use meta robots tags when you want to:

Prevent indexing. The noindex directive is the most reliable way to keep a page out of search results.
Control link equity flow. Use nofollow to tell search engines not to pass ranking signals through specific pages.
Manage cached versions. Use noarchive to prevent Google from showing a cached snapshot.
Control snippets. Use nosnippet or max-snippet to manage how your page appears in search results.

<!-- Don't index this page, but follow its links -->
<meta name="robots" content="noindex, follow">

<!-- Target a specific bot -->
<meta name="googlebot" content="noindex">

The Third Option: X-Robots-Tag HTTP Header

There's a third mechanism most people forget about. The X-Robots-Tag HTTP header does everything the meta robots tag does, but it works for any file type -- PDFs, images, JSON responses, anything that has an HTTP response.

HTTP/1.1 200 OK
X-Robots-Tag: noindex, nofollow
Content-Type: application/pdf

This is the only way to noindex a PDF or image. You can't put a <meta> tag inside a PDF file. And you shouldn't block these with robots.txt if you want Google to actually process the noindex.

Using Both Together

The right approach often involves using both mechanisms in coordination.

Scenario: Block AI crawlers, deindex old content.

# robots.txt -- block AI crawlers entirely
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

<!-- On old blog posts you want removed from Google -->
<meta name="robots" content="noindex">

Scenario: Save crawl budget on filter pages, but let Google know not to index them.

<!-- On /products?color=red&size=large pages -->
<meta name="robots" content="noindex, follow">

Don't block these with robots.txt. Let Google crawl them so it sees the noindex tag. The follow directive lets Google still discover products linked from these filter pages.

Audit your crawl control setup

Find mismatches between your robots.txt rules and meta robots tags before they cause SEO problems.

Test Your robots.txt

Decision Guide

Use this to pick the right mechanism for your situation.

Goal	Use This
Block a page from being crawled	robots.txt Disallow
Remove a page from search results	Meta robots noindex
Block a specific bot (AI, scraper)	robots.txt User-agent rules
Stop link equity flowing through a page	Meta robots nofollow
Prevent indexing of a PDF or image	X-Robots-Tag HTTP header
Save crawl budget on low-value pages	robots.txt Disallow
Hide cached version of a page	Meta robots noarchive
Block crawling AND prevent indexing	Allow crawling + noindex tag

That last row is counterintuitive but essential. If you want a page both uncrawled and unindexed, you need to allow crawling first so the noindex tag gets seen. Once Google processes the noindex and drops the page, you can then add the Disallow if you want to save crawl budget going forward.

Common Mistakes

Mistake 1: Using Disallow to deindex pages. This is the most common error. Disallow prevents crawling, not indexing. Pages can still appear in search results.

Mistake 2: Blocking JavaScript or CSS with robots.txt. Google needs to render your pages [2]. Blocking render-critical resources makes Google see a broken page, which hurts your rankings.

Mistake 3: Using both Disallow and noindex simultaneously. If robots.txt blocks the page, the noindex tag is invisible to crawlers. Pick one approach: usually, allow crawling and use noindex.

Mistake 4: Forgetting about X-Robots-Tag for non-HTML content. Meta tags only work in HTML. For PDFs, images, and other files, use the X-Robots-Tag header.

References

Test your robots.txt for free

Validate your robots.txt file instantly. Check directives, find crawling issues, and ensure search engines can access your site.

Test Your robots.txt

robots.txt vs Meta Robots Tags: Which to Use

What robots.txt Does

What Meta Robots Tags Do

noindex

nofollow

none

noarchive

nosnippet

The Critical Difference

When to Use robots.txt

When to Use Meta Robots Tags

The Third Option: X-Robots-Tag HTTP Header

Using Both Together

Decision Guide

Common Mistakes

References

Related Articles

Test your robots.txt for free