How to Create a robots.txt File
Step-by-step guide to creating a robots.txt file. Learn the syntax, write directives for search engines, and deploy your file correctly.
A robots.txt file tells search engine crawlers which pages they can and cannot access on your site. Every public website should have one. Without it, crawlers will attempt to access everything, and you lose control over what gets indexed.
This guide walks you through creating a robots.txt file from scratch, writing the directives you need, and deploying it correctly.
What Is robots.txt and Why You Need It
The robots.txt file is a plain text file that lives at the root of your domain. When a search engine crawler visits your site, it checks https://yourdomain.com/robots.txt before crawling anything else. The file follows the Robots Exclusion Protocol, a standard that has been around since 1994.
You need a robots.txt file to:
- Prevent crawlers from indexing admin pages, staging environments, or duplicate content
- Reduce server load by blocking crawlers from hitting resource-heavy pages
- Point crawlers to your sitemap
- Control which bots can access which parts of your site
The Four Core Directives
Every robots.txt file is built from a small set of directives. Here are the ones you will use most.
User-agent identifies which crawler the rules apply to. Use * to target all crawlers, or specify a bot by name like Googlebot.
Disallow tells a crawler not to access a specific path. Disallow: /admin/ blocks everything under /admin/.
Allow overrides a Disallow rule for a more specific path. This is useful when you block a directory but want to permit a subdirectory within it.
Sitemap tells crawlers where to find your XML sitemap. This directive is independent of any User-agent block.
Basic Syntax Rules
Before you start writing, know these rules:
- Each directive goes on its own line
- A
User-agentline starts a new block of rules - Rules within a block apply only to the specified User-agent
- Lines starting with
#are comments - The file must be plain text (UTF-8 encoded)
- The file must be named exactly
robots.txt(lowercase) - It must be served from the root of your domain
Creating a Minimal robots.txt
Here is the simplest useful robots.txt file. It allows all crawlers to access everything and points them to your sitemap:
User-agent: *
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml
If you want to block a specific directory, add a Disallow rule:
User-agent: *
Disallow: /admin/
Disallow: /private/
Sitemap: https://yourdomain.com/sitemap.xml
That blocks all crawlers from /admin/ and /private/ while allowing access to everything else.
Validate your robots.txt syntax
Paste your robots.txt file and instantly check for syntax errors, conflicting rules, and common mistakes.
Writing a Comprehensive robots.txt
Real-world robots.txt files often need more nuance. Here is a comprehensive example:
# Allow all search engines to crawl the public site
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /tmp/
Disallow: /search?
Disallow: /*.json$
# Give Googlebot access to everything except admin
User-agent: Googlebot
Allow: /
Disallow: /admin/
# Block AI training crawlers entirely
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
# Sitemap location
Sitemap: https://yourdomain.com/sitemap.xml
Sitemap: https://yourdomain.com/sitemap-blog.xml
A few things to note in this example:
- More specific User-agent blocks override the wildcard. Googlebot follows its own block, not the
*block. - You can list multiple Sitemap directives. Crawlers will check all of them.
- Wildcards work in paths.
*.json$matches any URL ending in.json. The*matches any sequence of characters, and$anchors to the end of the URL. - Query strings can be blocked.
Disallow: /search?blocks URLs like/search?q=test.
Using Wildcards Effectively
The robots.txt standard supports two wildcard characters:
*matches any sequence of characters$matches the end of the URL
Here are practical examples:
# Block all PDF files
User-agent: *
Disallow: /*.pdf$
# Block all URLs containing "print"
User-agent: *
Disallow: /*print*
# Block all URLs with query parameters
User-agent: *
Disallow: /*?*
Use wildcards carefully. An overly broad pattern can accidentally block pages you want indexed.
Step-by-Step Deployment
Create the file
Open a text editor and create a new file. Name it exactly robots.txt. Do not use .txt.txt or Robots.txt -- the name is case-sensitive on most servers.
Write your directives
Start with a User-agent: * block and add the Disallow rules you need. Add your Sitemap directive at the bottom.
Upload to your domain root
Upload the file to the root directory of your web server. It must be accessible at https://yourdomain.com/robots.txt. For most hosting setups, this means placing it in the public_html, www, or public folder.
Verify it is accessible
Open your browser and navigate to https://yourdomain.com/robots.txt. You should see the plain text content of your file. If you get a 404 error, the file is not in the right location.
Test your rules
Use a robots.txt testing tool to verify that your rules work as intended. Paste your file content and test specific URLs against it to make sure the right pages are allowed and blocked.
Test your robots.txt before deploying
Upload your robots.txt and test URLs against your rules before pushing to production.
Common Mistakes to Avoid
Watch out for these pitfalls
- Blocking your entire site with
Disallow: /underUser-agent: *. This tells all crawlers to stay away from everything. - Blocking CSS and JavaScript files. Google needs these to render your pages. If you block them, your pages may not be indexed correctly.
- Using relative Sitemap URLs. The Sitemap directive requires a full absolute URL including
https://. - Forgetting the trailing slash on directories.
Disallow: /adminblocks the page at/adminand anything under it.Disallow: /admin/only blocks paths under the/admin/directory. Be intentional about which you mean. - Assuming robots.txt blocks indexing. It blocks crawling, not indexing. If other sites link to a page you have disallowed, Google may still index the URL (without content). Use
noindexmeta tags if you need to prevent indexing entirely.
Testing Your robots.txt
After creating your file, always test it. A single typo can accidentally block your entire site from search engines.
You can test by:
- Fetching
https://yourdomain.com/robots.txtin your browser to verify it is live - Using Google Search Console's URL Inspection tool to check specific URLs
- Using a dedicated robots.txt testing tool to validate syntax and test URL matching
- Checking your server logs to see if crawlers are respecting your rules
The fastest method is using a purpose-built validator that parses your file and lets you test URLs against your rules instantly.
Related Articles
Your robots.txt is the first thing crawlers see. Make it count.
Test your robots.txt for free
Validate your robots.txt file instantly. Check directives, find crawling issues, and ensure search engines can access your site.