robots.txt Best Practices
Robots.txt best practices for SEO and crawl management. What to block, what to allow, and the mistakes that hurt your site.
A robots.txt file is deceptively simple. A few lines of text can either optimize your crawl budget or tank your organic traffic overnight. These best practices will keep you on the right side of that line.
Keep It Simple
The best robots.txt files are short and readable. If your file is 200 lines long, something has gone wrong. You're either micro-managing crawlers at a URL level (use meta robots tags instead) or you've accumulated rules over years without cleaning up.
A good robots.txt for most sites is under 20 lines. Block the directories that should never be crawled. Allow everything else. Done.
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /search?
Disallow: /tmp/
Sitemap: https://example.com/sitemap.xml
That covers 90% of sites. Add complexity only when you have a specific reason.
Always Include a Sitemap Directive
Every robots.txt file should reference your XML sitemap. This is the most reliable way to ensure crawlers discover all your important pages, especially deep pages that may not be reachable through internal links alone.
Sitemap: https://example.com/sitemap.xml
The Sitemap directive is independent of any User-agent block, so it applies to all crawlers. Use the full absolute URL including the protocol. If you have multiple sitemaps, list them all.
Sitemap location
Your sitemap does not need to be in the same directory as your robots.txt. The URL just needs to be absolute. Sitemap: https://example.com/sitemaps/main.xml works perfectly.
Do Not Block CSS, JavaScript, or Images
This is one of the most common mistakes. Blocking /css/, /js/, or /images/ in robots.txt prevents Google from rendering your pages correctly. Googlebot needs access to these resources to understand your page layout, content, and user experience.
If Google cannot render your page, it may not index it properly -- or at all.
# DO NOT do this
User-agent: *
Disallow: /css/
Disallow: /js/
Disallow: /images/
Disallow: /fonts/
If you have specific assets you need to block (like internal admin stylesheets), block them individually rather than blocking entire asset directories.
Check if you're blocking resources
Test your robots.txt to see if CSS, JavaScript, or image files are accidentally blocked from crawlers.
Do Not Use robots.txt to Hide Content
Robots.txt blocks crawling, not indexing. If another site links to a page you have blocked in robots.txt, Google can still index the URL. It will appear in search results with a message like "No information is available for this page" -- which is arguably worse than just letting it be indexed normally.
For pages that should not appear in search results, use the noindex meta tag or the X-Robots-Tag HTTP header. For pages that should not be publicly accessible at all, use authentication.
<!-- Use this to prevent indexing -->
<meta name="robots" content="noindex">
Robots.txt is for managing crawl budget, not for hiding pages.
Be Specific with Disallow Paths
Vague rules cause unintended collateral damage. Disallow: /a blocks /about, /api/, /articles/, and every other path starting with /a. Be specific.
| Do this | Not this |
|---|---|
| Disallow: /admin/ | Disallow: /a |
| Disallow: /search? | Disallow: /s |
| Disallow: /internal/api/ | Disallow: /internal |
| Disallow: /tmp/ | Disallow: /t |
| Disallow: /account/settings/ | Disallow: /account |
Always check what else a path prefix might match before adding a Disallow rule. One character difference can block or unblock hundreds of pages.
Test Before Deploying
Never push a robots.txt change to production without testing it first. A single typo or misplaced rule can deindex your entire site. Google can begin removing pages from search results within hours of encountering a new Disallow: /.
Before deploying, validate that:
- The syntax is correct (no typos in directives)
- Important pages are not accidentally blocked
- The rules match only what you intend
- Wildcard patterns are not overly broad
Use a robots.txt testing tool to verify each rule against specific URLs. The few minutes of testing can save you weeks of recovery.
Use Meta Robots for Noindex, Not Disallow
When people want a page to stop appearing in search results, they often reach for robots.txt. This is the wrong tool.
Disallow prevents crawling. But if Google already knows the URL from external links, it can still appear in search results -- just without any content snippet. The correct approach is to allow crawling but add a noindex directive via meta tag or HTTP header.
# Wrong approach: block crawling
User-agent: *
Disallow: /old-page
# Right approach: allow crawling, prevent indexing
# Add to the page's HTML:
# <meta name="robots" content="noindex">
If you block a page with robots.txt AND add a noindex tag, the crawler cannot reach the page to see the noindex tag, which means the noindex will never take effect.
Manage Crawl Budget for Large Sites
For sites with fewer than 10,000 pages, crawl budget rarely matters. Google will crawl everything eventually.
For large sites -- e-commerce stores with millions of product pages, news sites with decades of archives, or large-scale platforms -- crawl budget is critical. Crawlers have a finite number of pages they will crawl per visit.
Block low-value pages to direct crawlers toward pages that matter:
User-agent: *
Disallow: /search?
Disallow: /*?sort=*
Disallow: /*?filter=*
Disallow: /*?page=*
Disallow: /tag/*
Disallow: /author/*/page/*
Disallow: /category/*/page/*
Sitemap: https://example.com/sitemap.xml
This prevents crawlers from wasting time on paginated archives, filtered views, and internal search results -- pages that offer minimal unique content.
Audit your crawl budget impact
Test your robots.txt to see how your rules affect crawl budget allocation across your site.
Review After Site Migrations
Site migrations are the most dangerous time for your robots.txt. URL structures change, new directories appear, old directories disappear, and whoever handles the migration may not update robots.txt to match.
After every migration, verify:
- Existing
Disallowrules still target the correct paths - New important sections are not accidentally blocked
- Old rules blocking deprecated paths are cleaned up
- The
Sitemapdirective points to the correct URL
A robots.txt that was perfect for your old site structure can be actively harmful for your new one.
Consider AI Crawlers
AI training crawlers are a relatively new concern. If you do not want your content used for AI model training, you should explicitly block the known AI crawler user agents. If you do nothing, most AI crawlers will treat your content as fair game.
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
Decide your position on AI crawling and implement it deliberately. New AI crawlers appear regularly, so revisit this list periodically.
The Do / Don't Summary
| Do | Don't |
|---|---|
| Include a Sitemap directive | Use robots.txt to prevent indexing |
| Be specific with Disallow paths | Block CSS, JS, or image files |
| Test before deploying | Deploy without validating syntax |
| Use noindex for hiding from search | Assume Disallow = noindex |
| Review after site migrations | Set and forget your robots.txt |
| Block low-value pages on large sites | Micro-manage crawling per-URL |
| Keep it simple and readable | Let it grow to 200+ lines |
Related Articles
Good robots.txt practices protect your SEO. Bad ones destroy it.
Test your robots.txt for free
Validate your robots.txt file instantly. Check directives, find crawling issues, and ensure search engines can access your site.