robots.txt Best Practices

Robots.txt best practices for SEO and crawl management. What to block, what to allow, and the mistakes that hurt your site.

A robots.txt file is deceptively simple. A few lines of text can either optimize your crawl budget or tank your organic traffic overnight. These best practices will keep you on the right side of that line.

Keep It Simple

The best robots.txt files are short and readable. If your file is 200 lines long, something has gone wrong. You're either micro-managing crawlers at a URL level (use meta robots tags instead) or you've accumulated rules over years without cleaning up.

A good robots.txt for most sites is under 20 lines. Block the directories that should never be crawled. Allow everything else. Done.

User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /search?
Disallow: /tmp/

Sitemap: https://example.com/sitemap.xml

That covers 90% of sites. Add complexity only when you have a specific reason.

Always Include a Sitemap Directive

Every robots.txt file should reference your XML sitemap. This is the most reliable way to ensure crawlers discover all your important pages, especially deep pages that may not be reachable through internal links alone.

Sitemap: https://example.com/sitemap.xml

The Sitemap directive is independent of any User-agent block, so it applies to all crawlers. Use the full absolute URL including the protocol. If you have multiple sitemaps, list them all.

Sitemap location

Your sitemap does not need to be in the same directory as your robots.txt. The URL just needs to be absolute. Sitemap: https://example.com/sitemaps/main.xml works perfectly.

Do Not Block CSS, JavaScript, or Images

This is one of the most common mistakes. Blocking /css/, /js/, or /images/ in robots.txt prevents Google from rendering your pages correctly. Googlebot needs access to these resources to understand your page layout, content, and user experience.

If Google cannot render your page, it may not index it properly -- or at all.

# DO NOT do this
User-agent: *
Disallow: /css/
Disallow: /js/
Disallow: /images/
Disallow: /fonts/

If you have specific assets you need to block (like internal admin stylesheets), block them individually rather than blocking entire asset directories.

Check if you're blocking resources

Test your robots.txt to see if CSS, JavaScript, or image files are accidentally blocked from crawlers.

Do Not Use robots.txt to Hide Content

Robots.txt blocks crawling, not indexing. If another site links to a page you have blocked in robots.txt, Google can still index the URL. It will appear in search results with a message like "No information is available for this page" -- which is arguably worse than just letting it be indexed normally.

For pages that should not appear in search results, use the noindex meta tag or the X-Robots-Tag HTTP header. For pages that should not be publicly accessible at all, use authentication.

<!-- Use this to prevent indexing -->
<meta name="robots" content="noindex">

Robots.txt is for managing crawl budget, not for hiding pages.

Be Specific with Disallow Paths

Vague rules cause unintended collateral damage. Disallow: /a blocks /about, /api/, /articles/, and every other path starting with /a. Be specific.

Do thisNot this
Disallow: /admin/Disallow: /a
Disallow: /search?Disallow: /s
Disallow: /internal/api/Disallow: /internal
Disallow: /tmp/Disallow: /t
Disallow: /account/settings/Disallow: /account

Always check what else a path prefix might match before adding a Disallow rule. One character difference can block or unblock hundreds of pages.

Test Before Deploying

Never push a robots.txt change to production without testing it first. A single typo or misplaced rule can deindex your entire site. Google can begin removing pages from search results within hours of encountering a new Disallow: /.

Before deploying, validate that:

  • The syntax is correct (no typos in directives)
  • Important pages are not accidentally blocked
  • The rules match only what you intend
  • Wildcard patterns are not overly broad

Use a robots.txt testing tool to verify each rule against specific URLs. The few minutes of testing can save you weeks of recovery.

Use Meta Robots for Noindex, Not Disallow

When people want a page to stop appearing in search results, they often reach for robots.txt. This is the wrong tool.

Disallow prevents crawling. But if Google already knows the URL from external links, it can still appear in search results -- just without any content snippet. The correct approach is to allow crawling but add a noindex directive via meta tag or HTTP header.

# Wrong approach: block crawling
User-agent: *
Disallow: /old-page

# Right approach: allow crawling, prevent indexing
# Add to the page's HTML:
# <meta name="robots" content="noindex">

If you block a page with robots.txt AND add a noindex tag, the crawler cannot reach the page to see the noindex tag, which means the noindex will never take effect.

Manage Crawl Budget for Large Sites

For sites with fewer than 10,000 pages, crawl budget rarely matters. Google will crawl everything eventually.

For large sites -- e-commerce stores with millions of product pages, news sites with decades of archives, or large-scale platforms -- crawl budget is critical. Crawlers have a finite number of pages they will crawl per visit.

Block low-value pages to direct crawlers toward pages that matter:

User-agent: *
Disallow: /search?
Disallow: /*?sort=*
Disallow: /*?filter=*
Disallow: /*?page=*
Disallow: /tag/*
Disallow: /author/*/page/*
Disallow: /category/*/page/*

Sitemap: https://example.com/sitemap.xml

This prevents crawlers from wasting time on paginated archives, filtered views, and internal search results -- pages that offer minimal unique content.

Audit your crawl budget impact

Test your robots.txt to see how your rules affect crawl budget allocation across your site.

Review After Site Migrations

Site migrations are the most dangerous time for your robots.txt. URL structures change, new directories appear, old directories disappear, and whoever handles the migration may not update robots.txt to match.

After every migration, verify:

  • Existing Disallow rules still target the correct paths
  • New important sections are not accidentally blocked
  • Old rules blocking deprecated paths are cleaned up
  • The Sitemap directive points to the correct URL

A robots.txt that was perfect for your old site structure can be actively harmful for your new one.

Consider AI Crawlers

AI training crawlers are a relatively new concern. If you do not want your content used for AI model training, you should explicitly block the known AI crawler user agents. If you do nothing, most AI crawlers will treat your content as fair game.

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Decide your position on AI crawling and implement it deliberately. New AI crawlers appear regularly, so revisit this list periodically.

The Do / Don't Summary

DoDon't
Include a Sitemap directiveUse robots.txt to prevent indexing
Be specific with Disallow pathsBlock CSS, JS, or image files
Test before deployingDeploy without validating syntax
Use noindex for hiding from searchAssume Disallow = noindex
Review after site migrationsSet and forget your robots.txt
Block low-value pages on large sitesMicro-manage crawling per-URL
Keep it simple and readableLet it grow to 200+ lines

Good robots.txt practices protect your SEO. Bad ones destroy it.

Test your robots.txt for free

Validate your robots.txt file instantly. Check directives, find crawling issues, and ensure search engines can access your site.