robots.txt Best Practices

A robots.txt file is deceptively simple. A few lines of text can either optimize your crawl budget or tank your organic traffic overnight. For the complete robots.txt reference, see our robots.txt Guide. These best practices will keep you on the right side of that line.

Keep It Simple

The best robots.txt files are short and readable. If your file is 200 lines long, something has gone wrong. You're either micro-managing crawlers at a URL level (use meta robots tags instead) or you've accumulated rules over years without cleaning up.

A good robots.txt for most sites is under 20 lines. Block the directories that should never be crawled. Allow everything else. Done.

User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /search?
Disallow: /tmp/

Sitemap: https://example.com/sitemap.xml

That covers 90% of sites. Add complexity only when you have a specific reason.

Always Include a Sitemap Directive

Every robots.txt file should reference your XML sitemap. Learn how to add your sitemap to robots.txt. This is the most reliable way to ensure crawlers discover all your important pages, especially deep pages that may not be reachable through internal links alone.

Sitemap: https://example.com/sitemap.xml

The Sitemap directive is independent of any User-agent block, so it applies to all crawlers. Use the full absolute URL including the protocol. If you have multiple sitemaps, list them all.

Sitemap location

Your sitemap does not need to be in the same directory as your robots.txt. The URL just needs to be absolute. Sitemap: https://example.com/sitemaps/main.xml works perfectly.

Do Not Block CSS, JavaScript, or Images

This is one of the most common mistakes. Blocking /css/, /js/, or /images/ in robots.txt prevents Google from rendering your pages correctly. Googlebot needs access to these resources to understand your page layout, content, and user experience.

If Google cannot render your page, it may not index it properly -- or at all.

# DO NOT do this
User-agent: *
Disallow: /css/
Disallow: /js/
Disallow: /images/
Disallow: /fonts/

If you have specific assets you need to block (like internal admin stylesheets), block them individually rather than blocking entire asset directories.

Check if you're blocking resources

Test your robots.txt to see if CSS, JavaScript, or image files are accidentally blocked from crawlers.

Test Your robots.txt

Do Not Use robots.txt to Hide Content

Robots.txt blocks crawling, not indexing. If another site links to a page you have blocked in robots.txt, Google can still index the URL. It will appear in search results with a message like "No information is available for this page" -- which is arguably worse than just letting it be indexed normally.

For pages that should not appear in search results, use the noindex meta tag or the X-Robots-Tag HTTP header. For pages that should not be publicly accessible at all, use authentication.

<!-- Use this to prevent indexing -->
<meta name="robots" content="noindex">

Robots.txt is for managing crawl budget, not for hiding pages.

Be Specific with Disallow Paths

Vague rules cause unintended collateral damage. Disallow: /a blocks /about, /api/, /articles/, and every other path starting with /a. Be specific.

Do this	Not this
Disallow: /admin/	Disallow: /a
Disallow: /search?	Disallow: /s
Disallow: /internal/api/	Disallow: /internal
Disallow: /tmp/	Disallow: /t
Disallow: /account/settings/	Disallow: /account

Always check what else a path prefix might match before adding a Disallow rule. One character difference can block or unblock hundreds of pages.

Test Before Deploying

Never push a robots.txt change to production without testing it first. A single typo or misplaced rule can deindex your entire site. Google can begin removing pages from search results within hours of encountering a new Disallow: /.

Before deploying, validate that:

The syntax is correct (no typos in directives)
Important pages are not accidentally blocked
The rules match only what you intend
Wildcard patterns are not overly broad

Use a robots.txt testing tool to verify each rule against specific URLs. The few minutes of testing can save you weeks of recovery.

Use Meta Robots for Noindex, Not Disallow

When people want a page to stop appearing in search results, they often reach for robots.txt. This is the wrong tool.

Disallow prevents crawling. But if Google already knows the URL from external links, it can still appear in search results -- just without any content snippet. The correct approach is to allow crawling but add a noindex directive via meta tag or HTTP header.

# Wrong approach: block crawling
User-agent: *
Disallow: /old-page

# Right approach: allow crawling, prevent indexing
# Add to the page's HTML:
# <meta name="robots" content="noindex">

If you block a page with robots.txt AND add a noindex tag, the crawler cannot reach the page to see the noindex tag, which means the noindex will never take effect.

Manage Crawl Budget for Large Sites

For sites with fewer than 10,000 pages, crawl budget rarely matters. Google will crawl everything eventually.

For large sites -- e-commerce stores with millions of product pages, news sites with decades of archives, or large-scale platforms -- crawl budget is critical. Crawlers have a finite number of pages they will crawl per visit.

Block low-value pages to direct crawlers toward pages that matter:

User-agent: *
Disallow: /search?
Disallow: /*?sort=*
Disallow: /*?filter=*
Disallow: /*?page=*
Disallow: /tag/*
Disallow: /author/*/page/*
Disallow: /category/*/page/*

Sitemap: https://example.com/sitemap.xml

This prevents crawlers from wasting time on paginated archives, filtered views, and internal search results -- pages that offer minimal unique content.

Audit your crawl budget impact

Test your robots.txt to see how your rules affect crawl budget allocation across your site.

Test Your robots.txt

Review After Site Migrations

Site migrations are the most dangerous time for your robots.txt. URL structures change, new directories appear, old directories disappear, and whoever handles the migration may not update robots.txt to match.

After every migration, verify:

Existing Disallow rules still target the correct paths
New important sections are not accidentally blocked
Old rules blocking deprecated paths are cleaned up
The Sitemap directive points to the correct URL

A robots.txt that was perfect for your old site structure can be actively harmful for your new one.

Consider AI Crawlers

AI training crawlers are a relatively new concern. If you do not want your content used for AI model training, you should explicitly block the known AI crawler user agents. If you do nothing, most AI crawlers will treat your content as fair game.

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Decide your position on AI crawling and implement it deliberately. New AI crawlers appear regularly, so revisit this list periodically.

The Do / Don't Summary

Do	Don't
Include a Sitemap directive	Use robots.txt to prevent indexing
Be specific with Disallow paths	Block CSS, JS, or image files
Test before deploying	Deploy without validating syntax
Use noindex for hiding from search	Assume Disallow = noindex
Review after site migrations	Set and forget your robots.txt
Block low-value pages on large sites	Micro-manage crawling per-URL
Keep it simple and readable	Let it grow to 200+ lines

References

Test your robots.txt for free

Validate your robots.txt file instantly. Check directives, find crawling issues, and ensure search engines can access your site.

Test Your robots.txt

Keep It Simple

Always Include a Sitemap Directive

Do Not Block CSS, JavaScript, or Images

Do Not Use robots.txt to Hide Content

Be Specific with Disallow Paths

Test Before Deploying

Use Meta Robots for Noindex, Not Disallow

Manage Crawl Budget for Large Sites

Review After Site Migrations

Consider AI Crawlers

The Do / Don't Summary

References

Related Articles

Test your robots.txt for free