How to Create a robots.txt File

A robots.txt file tells search engine crawlers which pages they can and cannot access on your site. Every public website should have one. Without it, crawlers will attempt to access everything, and you lose control over what gets indexed. For the complete robots.txt reference, see our robots.txt Guide.

This guide walks you through creating a robots.txt file from scratch, writing the directives you need, and deploying it correctly.

What Is robots.txt and Why You Need It

The robots.txt file is a plain text file that lives at the root of your domain. When a search engine crawler visits your site, it checks https://yourdomain.com/robots.txt before crawling anything else. The file follows the Robots Exclusion Protocol [1], a standard that has been around since 1994.

You need a robots.txt file to:

Prevent crawlers from indexing admin pages, staging environments, or duplicate content
Reduce server load by blocking crawlers from hitting resource-heavy pages
Point crawlers to your sitemap
Control which bots can access which parts of your site

The Four Core Directives

Every robots.txt file is built from a small set of directives. Here are the ones you will use most.

User-agent identifies which crawler the rules apply to. Use * to target all crawlers, or specify a bot by name like Googlebot.

Disallow tells a crawler not to access a specific path. Disallow: /admin/ blocks everything under /admin/.

Allow overrides a Disallow rule for a more specific path. This is useful when you block a directory but want to permit a subdirectory within it.

Sitemap tells crawlers where to find your XML sitemap. This directive is independent of any User-agent block. See how to add your sitemap to robots.txt for details.

Basic Syntax Rules

Before you start writing, know these rules:

Each directive goes on its own line
A User-agent line starts a new block of rules
Rules within a block apply only to the specified User-agent
Lines starting with # are comments
The file must be plain text (UTF-8 encoded)
The file must be named exactly robots.txt (lowercase)
It must be served from the root of your domain

Creating a Minimal robots.txt

Here is the simplest useful robots.txt file. It allows all crawlers to access everything and points them to your sitemap:

User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

If you want to block a specific directory, add a Disallow rule:

User-agent: *
Disallow: /admin/
Disallow: /private/

Sitemap: https://yourdomain.com/sitemap.xml

That blocks all crawlers from /admin/ and /private/ while allowing access to everything else.

Validate your robots.txt syntax

Paste your robots.txt file and instantly check for syntax errors, conflicting rules, and common mistakes.

Test Your robots.txt

Writing a Comprehensive robots.txt

Real-world robots.txt files often need more nuance. Here is a comprehensive example:

# Allow all search engines to crawl the public site
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /tmp/
Disallow: /search?
Disallow: /*.json$

# Give Googlebot access to everything except admin
User-agent: Googlebot
Allow: /
Disallow: /admin/

# Block AI training crawlers entirely
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

# Sitemap location
Sitemap: https://yourdomain.com/sitemap.xml
Sitemap: https://yourdomain.com/sitemap-blog.xml

A few things to note in this example:

More specific User-agent blocks override the wildcard. Googlebot follows its own block, not the * block.
You can list multiple Sitemap directives. Crawlers will check all of them.
Wildcards work in paths. *.json$ matches any URL ending in .json. The * matches any sequence of characters, and $ anchors to the end of the URL.
Query strings can be blocked. Disallow: /search? blocks URLs like /search?q=test.

Using Wildcards Effectively

The robots.txt standard supports two wildcard characters:

* matches any sequence of characters
$ matches the end of the URL

Here are practical examples:

# Block all PDF files
User-agent: *
Disallow: /*.pdf$

# Block all URLs containing "print"
User-agent: *
Disallow: /*print*

# Block all URLs with query parameters
User-agent: *
Disallow: /*?*

Use wildcards carefully. An overly broad pattern can accidentally block pages you want indexed.

Step-by-Step Deployment

Create the file

Open a text editor and create a new file. Name it exactly robots.txt. Do not use .txt.txt or Robots.txt -- the name is case-sensitive on most servers.

Write your directives

Start with a User-agent: * block and add the Disallow rules you need. Add your Sitemap directive at the bottom.

Upload to your domain root

Upload the file to the root directory of your web server. It must be accessible at https://yourdomain.com/robots.txt. For most hosting setups, this means placing it in the public_html, www, or public folder.

Verify it is accessible

Open your browser and navigate to https://yourdomain.com/robots.txt. You should see the plain text content of your file. If you get a 404 error, the file is not in the right location.

Test your rules

Use a robots.txt testing tool to verify that your rules work as intended. Paste your file content and test specific URLs against it to make sure the right pages are allowed and blocked.

Test your robots.txt before deploying

Upload your robots.txt and test URLs against your rules before pushing to production.

Test Your robots.txt

Common Mistakes to Avoid

Watch out for these pitfalls

Blocking your entire site with Disallow: / under User-agent: *. This tells all crawlers to stay away from everything.
Blocking CSS and JavaScript files. Google needs these to render your pages. If you block them, your pages may not be indexed correctly.
Using relative Sitemap URLs. The Sitemap directive requires a full absolute URL including https://.
Forgetting the trailing slash on directories. Disallow: /admin blocks the page at /admin and anything under it. Disallow: /admin/ only blocks paths under the /admin/ directory. Be intentional about which you mean.
Assuming robots.txt blocks indexing. It blocks crawling, not indexing. If other sites link to a page you have disallowed, Google may still index the URL (without content). Use noindex meta tags if you need to prevent indexing entirely.

Testing Your robots.txt

After creating your file, always test it. A single typo can accidentally block your entire site from search engines.

You can test by:

Fetching https://yourdomain.com/robots.txt in your browser to verify it is live
Using Google Search Console's URL Inspection tool to check specific URLs
Using a dedicated robots.txt testing tool to validate syntax and test URL matching
Checking your server logs to see if crawlers are respecting your rules

The fastest method is using a purpose-built validator that parses your file and lets you test URLs against your rules instantly.

References

Test your robots.txt for free

Validate your robots.txt file instantly. Check directives, find crawling issues, and ensure search engines can access your site.

Test Your robots.txt

What Is robots.txt and Why You Need It

The Four Core Directives

Basic Syntax Rules

Creating a Minimal robots.txt

Writing a Comprehensive robots.txt

Using Wildcards Effectively

Step-by-Step Deployment

Create the file

Write your directives

Upload to your domain root

Verify it is accessible

Test your rules

Common Mistakes to Avoid

Testing Your robots.txt

References

Related Articles

Test your robots.txt for free