How to Check Any Website's robots.txt File
How to find and check any website's robots.txt file. View directives, verify accessibility, and check for common configuration mistakes.
Every website's robots.txt file is public. You can check any site's file in seconds, whether you are auditing your own site, investigating a competitor, or debugging why a page is not showing up in search results.
This guide covers how to find and inspect robots.txt files, what to look for, and how to spot common mistakes.
The URL Convention
The robots.txt file always lives at the root of a domain. The convention is simple and universal:
https://example.com/robots.txt
This works for any website. Just replace example.com with the domain you want to check. A few important rules:
- Each subdomain has its own file.
https://blog.example.com/robots.txtis separate fromhttps://example.com/robots.txt. - Protocol matters. If a site serves both HTTP and HTTPS, the robots.txt at
https://is the one that counts for HTTPS URLs. - The path is case-sensitive on most servers. Always use lowercase
/robots.txt.
Checking in Your Browser
The fastest way to check any site's robots.txt is to type the URL directly into your browser's address bar.
Navigate to the file
Open your browser and go to https://example.com/robots.txt. Replace example.com with the target domain.
Check the response
You should see plain text in your browser. If you see the site's regular web page, the server may be redirecting the request -- that is a misconfiguration.
Inspect the content
Read through the directives. Look for User-agent blocks, Disallow rules, Allow overrides, and Sitemap directives.
If you get a 404 error, the site does not have a robots.txt file. This is not an error in itself -- it just means there are no crawl restrictions. Crawlers will treat the entire site as accessible.
If you get a 403 error, the server is actively blocking access to the robots.txt file. This is problematic. When crawlers cannot access robots.txt, different bots handle it differently. Google will assume full access is allowed. Other crawlers may be more conservative.
What to Look For
When you open a robots.txt file, check these things in order.
Valid Directive Syntax
Every non-comment line should follow the format Directive: value. Look for:
- Lines that do not start with a recognized directive (
User-agent,Disallow,Allow,Sitemap,Crawl-delay) - Missing colons between the directive and value
- Directives with typos (e.g.,
Disalow,User-Agent:with inconsistent casing -- though most crawlers are forgiving on case)
The Wildcard Block
Most files start with a User-agent: * block. This sets the default rules for all crawlers. Check whether it is too restrictive or too permissive for the site's needs.
# Too restrictive -- blocks everything
User-agent: *
Disallow: /
# Too permissive -- blocks nothing
User-agent: *
Allow: /
Blocked Important Paths
Look for Disallow rules that might accidentally block important content:
/(blocks the entire site)/blog/on a content-heavy site/products/on an e-commerce site/css/,/js/,/images/(Google needs these to render pages)
Scan for robots.txt mistakes
Paste any robots.txt file and get an instant analysis of potential issues, blocked paths, and syntax errors.
Sitemap References
Check whether the file includes a Sitemap directive. If it does, verify that:
- The URL is absolute (starts with
https://) - The URL actually resolves to a valid XML sitemap
- The domain in the Sitemap URL matches the domain serving the robots.txt
Bot-Specific Rules
Look for User-agent blocks targeting specific crawlers. Common ones include Googlebot, Bingbot, GPTBot, and ClaudeBot. If a site has bot-specific rules, check whether they are more or less restrictive than the wildcard block.
Checking If Your Own Pages Are Blocked
If you are investigating why your own pages are not appearing in search results, you need to check whether your robots.txt is blocking them.
Manual check: Open your robots.txt and trace the rules for the URL in question. Identify which User-agent block applies (check for a Googlebot-specific block first, then fall back to *). Then check each Disallow and Allow rule in that block against your URL path.
Automated check: Use a robots.txt testing tool. Paste your file, enter the URL and user agent, and get an instant answer. This is faster and less error-prone, especially with complex files that use wildcards.
# Example: Is /blog/2024/my-post blocked?
User-agent: *
Disallow: /blog/2024/drafts/
Allow: /blog/
# Answer: /blog/2024/my-post is ALLOWED
# The Disallow only matches paths under /blog/2024/drafts/
Check the actual URL, not just the path
If your site uses query parameters (like ?page=2 or ?ref=twitter), test the full URL including the query string. Rules like Disallow: /*? will block any URL with parameters.
Missing File vs. Empty File
These are different scenarios and they mean different things:
No robots.txt (404 response): The site has no crawl restrictions. All crawlers can access all pages. This is the default state for any domain that has not explicitly created a robots.txt.
Empty robots.txt (200 response, no content): Same practical effect as a 404. No restrictions. But the empty file tells crawlers that the site owner intentionally chose not to restrict access.
robots.txt with only a Sitemap directive:
Sitemap: https://example.com/sitemap.xml
This provides the sitemap location without restricting any crawling. It is a common and valid configuration for sites that want to help crawlers find content without blocking anything.
Common Configuration Mistakes
When checking a robots.txt file -- whether yours or someone else's -- watch for these frequent errors:
Blocking CSS and JavaScript
Rules like Disallow: /wp-content/ or Disallow: /assets/ prevent Google from rendering pages. This degrades indexing quality.
Using relative Sitemap URLs
Sitemap: /sitemap.xml is invalid. The Sitemap directive requires an absolute URL with protocol and domain.
Conflicting rules without clear precedence
When Allow and Disallow rules overlap, the most specific (longest) path wins. If they are the same length, Allow wins. Many site owners do not realize this.
Multiple User-agent lines before a single rule set
Some files group multiple user agents before a set of rules. This is valid syntax but can be confusing to read.
Forgetting about subdomains
Rules in example.com/robots.txt do not apply to blog.example.com. Each subdomain needs its own file.
Check your robots.txt now
Paste your robots.txt file or enter your domain to get an instant health check with actionable recommendations.
Automating robots.txt Checks
If you manage multiple sites or want to monitor your robots.txt over time, you can automate checks.
With curl and a cron job:
curl -s -o /dev/null -w "%{http_code}" https://yourdomain.com/robots.txt
This returns just the HTTP status code. Set up an alert if it returns anything other than 200.
With a monitoring script:
# Fetch robots.txt and check for unwanted changes
curl -s https://yourdomain.com/robots.txt > /tmp/robots-current.txt
diff /tmp/robots-previous.txt /tmp/robots-current.txt
Compare the current file against a known-good version. If anything has changed, investigate before it affects your search traffic.
With a testing tool: Some robots.txt validators offer monitoring features that periodically check your file and alert you to changes or issues. This is the lowest-maintenance option for ongoing monitoring.
Related Articles
Checking robots.txt takes 10 seconds. Fixing a crawl issue takes weeks.
Test your robots.txt for free
Validate your robots.txt file instantly. Check directives, find crawling issues, and ensure search engines can access your site.