robots.txt SEO Audit Checklist

A misconfigured robots.txt file can silently kill your organic traffic. This checklist walks through every check you need to audit your robots.txt in under 10 minutes. For the complete robots.txt reference, see our robots.txt Guide. Run through it after every site migration, CMS update, or deployment pipeline change.

The 10-Point Audit

File exists and is accessible

Navigate to https://yourdomain.com/robots.txt in your browser. You should see a plain text response with a 200 status code.

Common failures:

404 response -- the file does not exist or is not in the domain root
301/302 redirect -- the file redirects elsewhere, which some crawlers may not follow
500 error -- server-side issue; Google will limit crawling until this resolves
HTML response -- your server is returning a web page instead of a plain text file; check your Content-Type header

# Verify with curl:
# curl -I https://yourdomain.com/robots.txt
# Look for:
# HTTP/2 200
# content-type: text/plain

No accidental Disallow: / blocking everything

Search your file for Disallow: / and check which User-agent it falls under. If it appears under User-agent: *, your entire site is blocked from all crawlers.

# DANGER: This blocks your entire site
User-agent: *
Disallow: /

# SAFE: This blocks only a specific bot
User-agent: GPTBot
Disallow: /

This is the single most damaging robots.txt mistake. Learn more about how Disallow works. It is also the most common one found in audits, especially on sites that were recently migrated from a staging environment.

Important pages are not blocked

Test your key pages against your robots.txt rules. At minimum, check:

Your homepage (/)
Your main navigation pages
High-traffic landing pages
Blog posts and content pages
Product or category pages (for e-commerce)

A Disallow rule might be broader than intended. Disallow: /product blocks /products/, /product-guide/, and /product-reviews/ -- not just /product.

CSS, JavaScript, and images are not blocked

Googlebot needs access to CSS, JavaScript, and image files to render your pages. If these resources are blocked, Google cannot understand your page layout, and pages may be indexed incorrectly or not at all.

Check for rules like:

# These should NOT be in your robots.txt
Disallow: /css/
Disallow: /js/
Disallow: /static/
Disallow: /assets/
Disallow: /images/
Disallow: /_next/
Disallow: /wp-content/themes/

If you must block some assets, be specific -- block individual files rather than entire directories.

Test your blocked resources

Paste your robots.txt and check if CSS, JavaScript, or image files are accidentally blocked from search engine crawlers.

Test Your robots.txt

Sitemap is referenced

Your robots.txt should include at least one Sitemap directive pointing to your XML sitemap. The URL must be absolute.

# Correct
Sitemap: https://example.com/sitemap.xml

# Wrong - relative URL
Sitemap: /sitemap.xml

# Wrong - HTTP instead of HTTPS
Sitemap: http://example.com/sitemap.xml

Verify that the sitemap URL actually returns a valid XML sitemap (200 response with XML content). A Sitemap directive pointing to a 404 is worse than no directive at all -- it wastes the crawler's time and signals poor site maintenance.

Syntax is valid

Check for common syntax errors:

Misspelled directives: Useragent instead of User-agent, Dissallow instead of Disallow
Missing colons: Disallow /admin/ instead of Disallow: /admin/
Missing leading slash: Disallow: admin/ instead of Disallow: /admin/
Rules without a User-agent: Every Disallow and Allow must be under a User-agent declaration
Invalid characters: Non-ASCII characters, BOM markers, or Windows line-ending issues

# Common syntax errors
Useragent: *           # Wrong: should be "User-agent"
Disallow /admin/       # Wrong: missing colon
Disallow: admin/       # Wrong: missing leading slash
disalow: /private/     # Wrong: misspelled

A syntax error does not just break the malformed line -- it can cause crawlers to misinterpret the entire rule group.

No conflicting rules

Look for rules that contradict each other in confusing ways. While the spec defines precedence (longest matching path wins), conflicting rules signal that your robots.txt has grown organically without clear intent.

# Confusing: allow and block the same path
User-agent: *
Disallow: /blog/
Allow: /blog/

# Clear: specific exception within a block
User-agent: *
Disallow: /blog/drafts/
Allow: /blog/

Review each User-agent group independently. Make sure every rule serves a clear purpose and does not conflict with another rule in the same group.

AI crawler rules are intentional

Check if your robots.txt addresses AI training crawlers and whether the rules match your organization's position on AI content usage.

If you have no AI crawler rules, that is a decision -- it means you are allowing AI crawlers to access your content. Make sure that is intentional.

If you do have AI crawler rules, verify you are covering the major agents:

# Common AI crawlers to consider
User-agent: GPTBot          # OpenAI
User-agent: ChatGPT-User    # OpenAI (ChatGPT browsing)
User-agent: Google-Extended  # Google AI training
User-agent: ClaudeBot        # Anthropic
User-agent: CCBot            # Common Crawl
User-agent: Bytespider       # ByteDance
User-agent: PerplexityBot    # Perplexity AI
User-agent: anthropic-ai     # Anthropic

New AI crawlers appear regularly. Review this list quarterly.

File size is under 500KB

Google stops processing robots.txt files after 500KB [1]. Any rules beyond that point are ignored and those URLs are treated as allowed.

Most sites will never hit this limit. But if your robots.txt is dynamically generated or includes per-URL rules, check the file size. If it is approaching the limit, refactor your rules to use wildcard patterns instead of listing individual URLs.

# Bad: listing individual URLs (file size grows fast)
Disallow: /page-1
Disallow: /page-2
Disallow: /page-3
# ... hundreds more lines

# Good: use a pattern
Disallow: /page-*
# or restructure your site to put these under one directory
Disallow: /legacy-pages/

File matches current site structure

The most overlooked check. Your robots.txt was written for a specific site structure. If your site has changed -- new URL patterns, removed directories, restructured content -- your robots.txt may be blocking pages that no longer exist (harmless but messy) or allowing pages that should be blocked (harmful).

After any significant site change, compare your robots.txt rules against your current URL structure:

Are blocked directories still relevant?
Have new directories been added that should be blocked?
Do wildcard patterns still match the right things?
Is the sitemap URL still correct?

Common Issues Found in Audits

Beyond the checklist items above, these issues come up frequently during robots.txt audits:

Run a full robots.txt audit

Paste your robots.txt and get instant feedback on syntax errors, blocking issues, and best-practice violations.

Test Your robots.txt

Leftover staging rules

The most dangerous find. A Disallow: / from a staging environment made it to production. Sometimes it is under a specific user agent (bad) or under the wildcard (catastrophic).

Overly broad wildcard patterns

Rules like Disallow: /*? block every URL with a query parameter, including legitimate paginated content, filtered views that should be indexed, and canonical URLs with tracking parameters.

Blocking the Sitemap URL

Rare but devastating. If your sitemap lives under a directory that is blocked by a Disallow rule, crawlers cannot access it. For example, Disallow: /seo/ would block https://example.com/seo/sitemap.xml.

Duplicate rules across groups

Multiple User-agent groups with identical rules add file size and complexity with no benefit. Consolidate them or use a shared wildcard group.

Missing trailing slash ambiguity

Disallow: /blog blocks /blog, /blog/, /blog/post-1, and also /blog-archive/. If you only meant to block the /blog/ directory, add the trailing slash: Disallow: /blog/.

After the Audit

Once you have completed the checklist:

Fix any issues found, starting with the most critical (accidental site-wide blocks, resource blocking).
Test the updated file against your key URLs using a robots.txt testing tool.
Deploy the fixed file.
Set up monitoring so you catch future regressions automatically.
Schedule your next audit -- quarterly is a good cadence, or after any major site change.

Automate what you can

Steps 1, 2, 5, 6, and 9 from this checklist can be automated as part of your CI/CD pipeline. A robots.txt linter that runs on every deployment catches the most common issues before they reach production.

References

Test your robots.txt for free

Validate your robots.txt file instantly. Check directives, find crawling issues, and ensure search engines can access your site.

Test Your robots.txt

The 10-Point Audit

File exists and is accessible

No accidental Disallow: / blocking everything

Important pages are not blocked

CSS, JavaScript, and images are not blocked

Sitemap is referenced

Syntax is valid

No conflicting rules

AI crawler rules are intentional

File size is under 500KB

File matches current site structure

Common Issues Found in Audits

Leftover staging rules

Overly broad wildcard patterns

Blocking the Sitemap URL

Duplicate rules across groups

Missing trailing slash ambiguity

After the Audit

References

Related Articles

Test your robots.txt for free