robots.txt SEO Audit Checklist
A checklist for auditing your robots.txt file. Check for SEO issues, crawl problems, and misconfigurations in under 10 minutes.
A misconfigured robots.txt file can silently kill your organic traffic. This checklist walks through every check you need to audit your robots.txt in under 10 minutes. Run through it after every site migration, CMS update, or deployment pipeline change.
The 10-Point Audit
File exists and is accessible
Navigate to https://yourdomain.com/robots.txt in your browser. You should see a plain text response with a 200 status code.
Common failures:
- 404 response -- the file does not exist or is not in the domain root
- 301/302 redirect -- the file redirects elsewhere, which some crawlers may not follow
- 500 error -- server-side issue; Google will limit crawling until this resolves
- HTML response -- your server is returning a web page instead of a plain text file; check your
Content-Typeheader
# Verify with curl:
# curl -I https://yourdomain.com/robots.txt
# Look for:
# HTTP/2 200
# content-type: text/plain
No accidental Disallow: / blocking everything
Search your file for Disallow: / and check which User-agent it falls under. If it appears under User-agent: *, your entire site is blocked from all crawlers.
# DANGER: This blocks your entire site
User-agent: *
Disallow: /
# SAFE: This blocks only a specific bot
User-agent: GPTBot
Disallow: /
This is the single most damaging robots.txt mistake. It is also the most common one found in audits, especially on sites that were recently migrated from a staging environment.
Important pages are not blocked
Test your key pages against your robots.txt rules. At minimum, check:
- Your homepage (
/) - Your main navigation pages
- High-traffic landing pages
- Blog posts and content pages
- Product or category pages (for e-commerce)
A Disallow rule might be broader than intended. Disallow: /product blocks /products/, /product-guide/, and /product-reviews/ -- not just /product.
CSS, JavaScript, and images are not blocked
Googlebot needs access to CSS, JavaScript, and image files to render your pages. If these resources are blocked, Google cannot understand your page layout, and pages may be indexed incorrectly or not at all.
Check for rules like:
# These should NOT be in your robots.txt
Disallow: /css/
Disallow: /js/
Disallow: /static/
Disallow: /assets/
Disallow: /images/
Disallow: /_next/
Disallow: /wp-content/themes/
If you must block some assets, be specific -- block individual files rather than entire directories.
Test your blocked resources
Paste your robots.txt and check if CSS, JavaScript, or image files are accidentally blocked from search engine crawlers.
Sitemap is referenced
Your robots.txt should include at least one Sitemap directive pointing to your XML sitemap. The URL must be absolute.
# Correct
Sitemap: https://example.com/sitemap.xml
# Wrong - relative URL
Sitemap: /sitemap.xml
# Wrong - HTTP instead of HTTPS
Sitemap: http://example.com/sitemap.xml
Verify that the sitemap URL actually returns a valid XML sitemap (200 response with XML content). A Sitemap directive pointing to a 404 is worse than no directive at all -- it wastes the crawler's time and signals poor site maintenance.
Syntax is valid
Check for common syntax errors:
- Misspelled directives:
Useragentinstead ofUser-agent,Dissallowinstead ofDisallow - Missing colons:
Disallow /admin/instead ofDisallow: /admin/ - Missing leading slash:
Disallow: admin/instead ofDisallow: /admin/ - Rules without a User-agent: Every
DisallowandAllowmust be under aUser-agentdeclaration - Invalid characters: Non-ASCII characters, BOM markers, or Windows line-ending issues
# Common syntax errors
Useragent: * # Wrong: should be "User-agent"
Disallow /admin/ # Wrong: missing colon
Disallow: admin/ # Wrong: missing leading slash
disalow: /private/ # Wrong: misspelled
A syntax error does not just break the malformed line -- it can cause crawlers to misinterpret the entire rule group.
No conflicting rules
Look for rules that contradict each other in confusing ways. While the spec defines precedence (longest matching path wins), conflicting rules signal that your robots.txt has grown organically without clear intent.
# Confusing: allow and block the same path
User-agent: *
Disallow: /blog/
Allow: /blog/
# Clear: specific exception within a block
User-agent: *
Disallow: /blog/drafts/
Allow: /blog/
Review each User-agent group independently. Make sure every rule serves a clear purpose and does not conflict with another rule in the same group.
AI crawler rules are intentional
Check if your robots.txt addresses AI training crawlers and whether the rules match your organization's position on AI content usage.
If you have no AI crawler rules, that is a decision -- it means you are allowing AI crawlers to access your content. Make sure that is intentional.
If you do have AI crawler rules, verify you are covering the major agents:
# Common AI crawlers to consider
User-agent: GPTBot # OpenAI
User-agent: ChatGPT-User # OpenAI (ChatGPT browsing)
User-agent: Google-Extended # Google AI training
User-agent: ClaudeBot # Anthropic
User-agent: CCBot # Common Crawl
User-agent: Bytespider # ByteDance
User-agent: PerplexityBot # Perplexity AI
User-agent: anthropic-ai # Anthropic
New AI crawlers appear regularly. Review this list quarterly.
File size is under 500KB
Google stops processing robots.txt files after 500KB. Any rules beyond that point are ignored and those URLs are treated as allowed.
Most sites will never hit this limit. But if your robots.txt is dynamically generated or includes per-URL rules, check the file size. If it is approaching the limit, refactor your rules to use wildcard patterns instead of listing individual URLs.
# Bad: listing individual URLs (file size grows fast)
Disallow: /page-1
Disallow: /page-2
Disallow: /page-3
# ... hundreds more lines
# Good: use a pattern
Disallow: /page-*
# or restructure your site to put these under one directory
Disallow: /legacy-pages/
File matches current site structure
The most overlooked check. Your robots.txt was written for a specific site structure. If your site has changed -- new URL patterns, removed directories, restructured content -- your robots.txt may be blocking pages that no longer exist (harmless but messy) or allowing pages that should be blocked (harmful).
After any significant site change, compare your robots.txt rules against your current URL structure:
- Are blocked directories still relevant?
- Have new directories been added that should be blocked?
- Do wildcard patterns still match the right things?
- Is the sitemap URL still correct?
Common Issues Found in Audits
Beyond the checklist items above, these issues come up frequently during robots.txt audits:
Run a full robots.txt audit
Paste your robots.txt and get instant feedback on syntax errors, blocking issues, and best-practice violations.
Leftover staging rules
The most dangerous find. A Disallow: / from a staging environment made it to production. Sometimes it is under a specific user agent (bad) or under the wildcard (catastrophic).
Overly broad wildcard patterns
Rules like Disallow: /*? block every URL with a query parameter, including legitimate paginated content, filtered views that should be indexed, and canonical URLs with tracking parameters.
Blocking the Sitemap URL
Rare but devastating. If your sitemap lives under a directory that is blocked by a Disallow rule, crawlers cannot access it. For example, Disallow: /seo/ would block https://example.com/seo/sitemap.xml.
Duplicate rules across groups
Multiple User-agent groups with identical rules add file size and complexity with no benefit. Consolidate them or use a shared wildcard group.
Missing trailing slash ambiguity
Disallow: /blog blocks /blog, /blog/, /blog/post-1, and also /blog-archive/. If you only meant to block the /blog/ directory, add the trailing slash: Disallow: /blog/.
After the Audit
Once you have completed the checklist:
- Fix any issues found, starting with the most critical (accidental site-wide blocks, resource blocking).
- Test the updated file against your key URLs using a robots.txt testing tool.
- Deploy the fixed file.
- Set up monitoring so you catch future regressions automatically.
- Schedule your next audit -- quarterly is a good cadence, or after any major site change.
Automate what you can
Steps 1, 2, 5, 6, and 9 from this checklist can be automated as part of your CI/CD pipeline. A robots.txt linter that runs on every deployment catches the most common issues before they reach production.
Related Articles
Ten checks. Ten minutes. Catch the robots.txt issues before they catch you.
Test your robots.txt for free
Validate your robots.txt file instantly. Check directives, find crawling issues, and ensure search engines can access your site.