How to Check Any Website's robots.txt File

Every website's robots.txt file is public. You can check any site's file in seconds, whether you are auditing your own site, investigating a competitor, or debugging why a page is not showing up in search results. For the complete robots.txt reference, see our robots.txt Guide.

This guide covers how to find and inspect robots.txt files, what to look for, and how to spot common mistakes.

The URL Convention

The robots.txt file always lives at the root of a domain. The convention is simple and universal:

https://example.com/robots.txt

This works for any website. Just replace example.com with the domain you want to check. A few important rules:

Each subdomain has its own file. https://blog.example.com/robots.txt is separate from https://example.com/robots.txt.
Protocol matters. If a site serves both HTTP and HTTPS, the robots.txt at https:// is the one that counts for HTTPS URLs.
The path is case-sensitive on most servers. Always use lowercase /robots.txt.

Checking in Your Browser

The fastest way to check any site's robots.txt is to type the URL directly into your browser's address bar.

Navigate to the file

Open your browser and go to https://example.com/robots.txt. Replace example.com with the target domain.

Check the response

You should see plain text in your browser. If you see the site's regular web page, the server may be redirecting the request -- that is a misconfiguration.

Inspect the content

Read through the directives. Look for User-agent blocks, Disallow rules, Allow overrides, and Sitemap directives.

If you get a 404 error, the site does not have a robots.txt file. This is not an error in itself -- it just means there are no crawl restrictions. Crawlers will treat the entire site as accessible.

If you get a 403 error, the server is actively blocking access to the robots.txt file. You may want to check DNS records to verify the domain is resolving correctly. This is problematic. When crawlers cannot access robots.txt, different bots handle it differently. Google will assume full access is allowed. Other crawlers may be more conservative.

What to Look For

When you open a robots.txt file, check these things in order.

Valid Directive Syntax

Every non-comment line should follow the format Directive: value. Look for:

Lines that do not start with a recognized directive (User-agent, Disallow, Allow, Sitemap, Crawl-delay)
Missing colons between the directive and value
Directives with typos (e.g., Disalow, User-Agent: with inconsistent casing -- though most crawlers are forgiving on case)

The Wildcard Block

Most files start with a User-agent: * block. This sets the default rules for all crawlers. Check whether it is too restrictive or too permissive for the site's needs.

# Too restrictive -- blocks everything
User-agent: *
Disallow: /

# Too permissive -- blocks nothing
User-agent: *
Allow: /

Blocked Important Paths

Look for Disallow rules that might accidentally block important content:

/ (blocks the entire site)
/blog/ on a content-heavy site
/products/ on an e-commerce site
/css/, /js/, /images/ (Google needs these to render pages)

Scan for robots.txt mistakes

Paste any robots.txt file and get an instant analysis of potential issues, blocked paths, and syntax errors.

Test Your robots.txt

Sitemap References

Check whether the file includes a Sitemap directive. Learn how to properly add your sitemap to robots.txt. If it does, verify that:

The URL is absolute (starts with https://)
The URL actually resolves to a valid XML sitemap
The domain in the Sitemap URL matches the domain serving the robots.txt

Bot-Specific Rules

Look for User-agent blocks targeting specific crawlers. Common ones include Googlebot, Bingbot, GPTBot, and ClaudeBot. If a site has bot-specific rules, check whether they are more or less restrictive than the wildcard block.

Checking If Your Own Pages Are Blocked

If you are investigating why your own pages are not appearing in search results, you need to check whether your robots.txt is blocking them.

Manual check: Open your robots.txt and trace the rules for the URL in question. Identify which User-agent block applies (check for a Googlebot-specific block first, then fall back to *). Then check each Disallow and Allow rule in that block against your URL path.

Automated check: Use a robots.txt testing tool. Paste your file, enter the URL and user agent, and get an instant answer. This is faster and less error-prone, especially with complex files that use wildcards.

# Example: Is /blog/2024/my-post blocked?
User-agent: *
Disallow: /blog/2024/drafts/
Allow: /blog/

# Answer: /blog/2024/my-post is ALLOWED
# The Disallow only matches paths under /blog/2024/drafts/

Check the actual URL, not just the path

If your site uses query parameters (like ?page=2 or ?ref=twitter), test the full URL including the query string. Rules like Disallow: /*? will block any URL with parameters.

Missing File vs. Empty File

These are different scenarios and they mean different things:

No robots.txt (404 response): The site has no crawl restrictions. All crawlers can access all pages. This is the default state for any domain that has not explicitly created a robots.txt.

Empty robots.txt (200 response, no content): Same practical effect as a 404. No restrictions. But the empty file tells crawlers that the site owner intentionally chose not to restrict access.

robots.txt with only a Sitemap directive:

Sitemap: https://example.com/sitemap.xml

This provides the sitemap location without restricting any crawling. It is a common and valid configuration for sites that want to help crawlers find content without blocking anything.

Common Configuration Mistakes

When checking a robots.txt file -- whether yours or someone else's -- watch for these frequent errors:

Blocking CSS and JavaScript

Rules like Disallow: /wp-content/ or Disallow: /assets/ prevent Google from rendering pages. This degrades indexing quality.

Using relative Sitemap URLs

Sitemap: /sitemap.xml is invalid. The Sitemap directive requires an absolute URL with protocol and domain.

Conflicting rules without clear precedence

When Allow and Disallow rules overlap, the most specific (longest) path wins per the RFC 9309 standard [1]. If they are the same length, Allow wins. Many site owners do not realize this.

Multiple User-agent lines before a single rule set

Some files group multiple user agents before a set of rules. This is valid syntax but can be confusing to read.

Forgetting about subdomains

Rules in example.com/robots.txt do not apply to blog.example.com. Each subdomain needs its own file.

Check your robots.txt now

Paste your robots.txt file or enter your domain to get an instant health check with actionable recommendations.

Test Your robots.txt

Automating robots.txt Checks

If you manage multiple sites or want to monitor your robots.txt over time, you can automate checks.

With curl and a cron job:

curl -s -o /dev/null -w "%{http_code}" https://yourdomain.com/robots.txt

This returns just the HTTP status code. Set up an alert if it returns anything other than 200.

With a monitoring script:

# Fetch robots.txt and check for unwanted changes
curl -s https://yourdomain.com/robots.txt > /tmp/robots-current.txt
diff /tmp/robots-previous.txt /tmp/robots-current.txt

Compare the current file against a known-good version. If anything has changed, investigate before it affects your search traffic.

With a testing tool: Some robots.txt validators offer monitoring features that periodically check your file and alert you to changes or issues. This is the lowest-maintenance option for ongoing monitoring.

References

Test your robots.txt for free

Validate your robots.txt file instantly. Check directives, find crawling issues, and ensure search engines can access your site.

Test Your robots.txt

The URL Convention

Checking in Your Browser

Navigate to the file

Check the response

Inspect the content

What to Look For

Valid Directive Syntax

The Wildcard Block

Blocked Important Paths

Sitemap References

Bot-Specific Rules

Checking If Your Own Pages Are Blocked

Missing File vs. Empty File

Common Configuration Mistakes

Blocking CSS and JavaScript

Using relative Sitemap URLs

Conflicting rules without clear precedence

Multiple User-agent lines before a single rule set

Forgetting about subdomains

Automating robots.txt Checks

References

Related Articles

Test your robots.txt for free