How to Block AI Crawlers with robots.txt

AI companies are crawling the web to train their models, and your content is fair game unless you explicitly block them. Unlike search engine crawlers that index your pages so users can find you, AI training crawlers take your content to build commercial products. You get nothing in return -- no traffic, no attribution, no compensation.

If you want to control which AI systems can access your content, robots.txt is your first line of defense. For the complete robots.txt reference, see our robots.txt Guide. This guide gives you the complete list of AI crawler user agents, copy-paste rules to block them, and the nuance you need to make informed decisions.

Search Crawlers vs. AI Training Crawlers

Not all crawlers are created equal. Understanding the difference before you start blocking is essential.

Search engine crawlers (Googlebot, Bingbot) index your content so it appears in search results. Blocking them means losing organic traffic. You almost always want to allow these.

AI training crawlers scrape your content to train large language models and other AI systems. Blocking them does not affect your search rankings or traffic. Your content simply will not be used to train those models.

AI-powered search crawlers are a newer category. These crawlers fetch your content to generate AI-powered search answers (like Perplexity or Google's AI Overviews). Blocking them means your content will not be cited in AI search results, but your regular search rankings are unaffected.

Some companies use separate user agents for search and training. Google, for example, uses Googlebot for search indexing and previously used Google-Extended for AI training data. Others bundle everything under one user agent.

robots.txt is voluntary

The Robots Exclusion Protocol [1] is a gentleman's agreement. Major AI companies have committed to respecting robots.txt, but there is no technical enforcement. A bad actor can ignore it entirely. For critical content protection, you may need additional measures like rate limiting, authentication, or legal action.

Complete List of AI Crawler User Agents

Here is the most comprehensive list of known AI crawler user agents as of early 2026. New crawlers appear regularly, so check back periodically.

OpenAI

| User Agent | Purpose | |---|---| | GPTBot | Training data collection for GPT models | | ChatGPT-User | Real-time web browsing when users ask ChatGPT to search the web | | OAI-SearchBot | OpenAI's search feature crawler |

Anthropic

| User Agent | Purpose | |---|---| | ClaudeBot | Training data collection for Claude models | | anthropic-ai | Older Anthropic crawler identifier | | Claude-Web | Claude's web browsing feature |

Google

| User Agent | Purpose | |---|---| | Google-Extended | Previously used for Gemini/Bard training data (separate from Googlebot) | | Googlebot | Standard search indexing (you probably want to keep this allowed) |

Other AI Companies

| User Agent | Purpose | |---|---| | PerplexityBot | Perplexity AI search and training | | Bytespider | ByteDance (TikTok parent) crawler, used for various AI purposes | | CCBot | Common Crawl bot, data used by many AI training pipelines | | Diffbot | Web scraping service used for AI training datasets | | Applebot-Extended | Apple's AI training data collection (separate from regular Applebot) | | cohere-ai | Cohere AI's training data crawler | | Amazonbot | Amazon's crawler, used for Alexa and AI features | | YouBot | You.com AI search crawler | | Scrapy | Open-source scraping framework (generic, not specific to one company) | | ImagesiftBot | Image scraping for AI training | | Omgilibot | Data mining crawler | | Timpibot | Timpi search engine crawler | | PetalBot | Huawei's search and AI crawler | | Seekr | Seekr AI crawler |

Test your AI blocker rules

Paste your robots.txt and verify that your rules correctly block AI crawlers while keeping search engines allowed.

Test Your robots.txt

Copy-Paste Rules: Block All AI Crawlers

Add these rules to your robots.txt to block all known AI training crawlers. These rules do not affect your search engine indexing.

# Block AI training crawlers
# OpenAI
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

# Anthropic
User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Claude-Web
Disallow: /

# Google AI (does not affect regular Google Search)
User-agent: Google-Extended
Disallow: /

# Meta
User-agent: FacebookBot
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: Meta-ExternalFetcher
Disallow: /

# Perplexity
User-agent: PerplexityBot
Disallow: /

# ByteDance
User-agent: Bytespider
Disallow: /

# Common Crawl
User-agent: CCBot
Disallow: /

# Apple AI
User-agent: Applebot-Extended
Disallow: /

# Cohere
User-agent: cohere-ai
Disallow: /

# Amazon
User-agent: Amazonbot
Disallow: /

# Diffbot
User-agent: Diffbot
Disallow: /

# You.com
User-agent: YouBot
Disallow: /

# Other AI crawlers
User-agent: ImagesiftBot
Disallow: /

User-agent: Omgilibot
Disallow: /

User-agent: PetalBot
Disallow: /

User-agent: Seekr
Disallow: /

User-agent: Timpibot
Disallow: /

Copy this entire block and add it to your robots.txt file before the Sitemap directive.

Block Some, Allow Others

You may not want to block every AI crawler. Perhaps you want your content to appear in Perplexity search results but not be used for model training. Or you are fine with Google's AI features but want to block other companies.

Block training crawlers only, allow AI search:

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: Meta-ExternalFetcher
Disallow: /

# Allow AI search crawlers (for citation in AI-powered search)
# ChatGPT-User - browsing feature
# PerplexityBot - Perplexity search
# OAI-SearchBot - OpenAI search
# (No Disallow rules for these, so they default to allowed)

Allow only Google's AI features:

# Block all AI crawlers except Google
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

# Google-Extended is NOT blocked, allowing Google AI features
# Googlebot is NOT blocked, allowing regular search indexing

Think about your goals

Blocking AI crawlers is not all-or-nothing. Consider what you want: Do you want your content in AI search results? Do you mind if your content trains models? Different answers lead to different configurations.

Blocking AI Crawlers from Specific Sections

You might want AI crawlers to access some parts of your site but not others. For example, you could allow your blog posts to be used as citations in AI search while blocking your premium content:

User-agent: GPTBot
Allow: /blog/
Disallow: /

User-agent: PerplexityBot
Allow: /blog/
Disallow: /

User-agent: ClaudeBot
Disallow: /

This allows GPTBot and PerplexityBot to access your blog but blocks them from everything else. ClaudeBot is blocked entirely. You can use wildcards in robots.txt for more advanced path matching.

Verify your selective blocking rules

Test URLs against your robots.txt to make sure your Allow and Disallow rules for AI crawlers work exactly as intended.

Test Your robots.txt

Platform-Specific Instructions

WordPress

If you use Yoast SEO or Rank Math, open the robots.txt editor in the plugin settings and add the AI crawler blocks. If you have a physical robots.txt file, edit it directly. See our WordPress robots.txt guide for detailed instructions.

Shopify

Create a robots.txt.liquid template in your theme and add the AI crawler blocks. Your custom rules will be appended to Shopify's defaults. See our Shopify robots.txt guide for step-by-step instructions.

Next.js / Static Sites

Add the rules directly to your robots.txt file in your public directory. If you generate robots.txt dynamically, add the AI crawler blocks to your generation logic.

Nginx / Apache

If you serve robots.txt through your web server configuration, add the AI crawler blocks to the response. Make sure the complete file is served as text/plain.

Verifying Your Rules Are Working

After adding AI crawler blocks, you need to verify they are actually in effect.

Check the live file

Open https://yourdomain.com/robots.txt in your browser. Verify that all the AI crawler User-agent blocks are present with Disallow: /.

Test with a validator

Use a robots.txt testing tool. Enter a URL from your site and select each AI crawler user agent. The tool should report "Blocked" for each one.

Check server logs

If you have access to server logs, search for the AI crawler user agent strings. You should see requests for /robots.txt followed by no further requests to other pages. If you still see AI crawlers hitting your content, they may not be respecting your rules.

Monitor over time

AI companies release new crawlers and change user agent strings. Check this list periodically and update your robots.txt when new crawlers appear.

Beyond robots.txt

While robots.txt is the standard approach, there are additional layers of protection you can consider.

Meta tags: Add <meta name="robots" content="noai, noimageai"> to your HTML pages. This is a newer convention that some AI companies respect.

HTTP headers: Use the X-Robots-Tag header to set directives at the server level, especially useful for non-HTML resources like PDFs and images. See robots.txt vs Meta Robots Tags for when to use each approach.

AI.txt: Some sites use an ai.txt file (similar to robots.txt) to express AI-specific preferences. This is not yet a standard, but some companies check for it.

Rate limiting: Configure your web server to rate-limit requests from known AI crawler IP ranges. This provides technical enforcement, not just a polite request.

Legal measures: Include clear terms of service stating that automated scraping for AI training is prohibited. This gives you legal recourse if bots ignore your technical controls.

The landscape is changing fast

AI crawling standards are evolving rapidly. New crawlers, new conventions, and new legal frameworks are appearing regularly. What works today may need updating in six months. Bookmark this guide and check back for updates.

Frequently Asked Questions

Will blocking AI crawlers affect my Google search rankings?

No. Blocking GPTBot, ClaudeBot, Google-Extended, and other AI crawlers does not affect your Googlebot search indexing. These are separate user agents with separate rules.

Do all AI companies respect robots.txt?

The major companies (OpenAI, Anthropic, Google, Meta, Apple) have publicly committed to respecting robots.txt. Smaller or less scrupulous crawlers may not. There is no technical enforcement mechanism built into robots.txt.

Should I block ChatGPT-User?

It depends. ChatGPT-User is the crawler that fetches pages when a ChatGPT user asks it to browse the web in real-time. Blocking it means your content will not appear when users ask ChatGPT to look something up. This is different from GPTBot, which collects training data.

How often should I update my AI crawler list?

Check quarterly at minimum. New AI products and crawlers are launching frequently. When a major AI product launches, check what user agent it uses and add a block if needed.

References

Test your robots.txt for free

Validate your robots.txt file instantly. Check directives, find crawling issues, and ensure search engines can access your site.

Test Your robots.txt

Search Crawlers vs. AI Training Crawlers

Complete List of AI Crawler User Agents

OpenAI

Anthropic

Google

Meta

Other AI Companies

Copy-Paste Rules: Block All AI Crawlers

Block Some, Allow Others

Blocking AI Crawlers from Specific Sections

Platform-Specific Instructions

WordPress

Shopify

Next.js / Static Sites

Nginx / Apache

Verifying Your Rules Are Working

Check the live file

Test with a validator

Check server logs

Monitor over time

Beyond robots.txt

Frequently Asked Questions

References

Related Articles

Test your robots.txt for free