robots.txt for Content Publishers

You publish an investigative piece that took your team three months to report. It gets picked up across the web, drives significant traffic, and wins industry recognition. Six months later, you discover that the article -- along with thousands of others from your archive -- has been scraped and used to train a large language model. Your content is now being regurgitated in AI chatbot responses without attribution, without compensation, and without your permission.

You want to block AI training crawlers. But you also depend on Google and Bing for the majority of your traffic. You need search engines to find your content. The question is how to let the right crawlers in while keeping the wrong ones out.

The answer starts with your robots.txt file.

The publisher's dilemma

Publishers live in a tension that did not exist five years ago. Search engine crawlers and AI training crawlers both want access to your content, but for fundamentally different purposes.

Search engine crawlers	AI training crawlers
Index your content and send you traffic	Scrape your content to train models
Drive readers to your site	May reduce the need for readers to visit your site
Respect your paywall signals	May scrape content regardless of paywalls
You benefit from being crawled	The benefit flows primarily to the AI company
Blocking them kills your traffic	Blocking them protects your content

You cannot use a single Disallow: / to solve this. That would block everything, including the search engines that drive your audience to your site. You need selective blocking -- allowing Googlebot and Bingbot while denying GPTBot, ClaudeBot, and other AI crawlers.

How robots.txt controls different crawler types

Every web crawler identifies itself with a User-agent string. Your robots.txt file can target rules to specific crawlers by name. This means you can write one set of rules for search engines and a completely different set for AI training bots.

# Allow search engines full access
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: CCBot
Disallow: /

# Default: allow everything else
User-agent: *
Allow: /

Sitemap: https://yourpublication.com/sitemap.xml

This configuration lets search engines crawl your entire site while blocking major AI training crawlers from accessing any of your content.

robots.txt is advisory, not enforcement

Well-behaved crawlers from major companies respect robots.txt rules. OpenAI's GPTBot, Anthropic's ClaudeBot, and Google's Google-Extended all honor these directives. However, robots.txt cannot stop bad actors or scrapers that ignore the protocol entirely. For stronger protection, consider additional measures like rate limiting and access controls.

The AI crawlers you should know about

The landscape of AI crawlers is growing. Here are the major ones publishers should consider blocking or allowing based on their content strategy.

GPTBot (OpenAI)

OpenAI's crawler used to gather training data. User-agent: GPTBot. Blocking this prevents your content from being used to train future GPT models. OpenAI states that it respects robots.txt.

ClaudeBot (Anthropic)

Anthropic's web crawler. User-agent: ClaudeBot. Used for training Claude models. Anthropic respects robots.txt directives.

Google-Extended

Google's AI-specific crawler, separate from Googlebot. User-agent: Google-Extended. Blocking this prevents your content from being used for Google's AI products (like Gemini) while still allowing regular Google Search indexing via Googlebot.

PerplexityBot

Perplexity AI's crawler. User-agent: PerplexityBot. Used to power Perplexity's AI search engine. Blocking this prevents your content from being cited in Perplexity answers.

CCBot (Common Crawl)

The Common Crawl project's crawler. User-agent: CCBot. Common Crawl data is widely used by AI companies for training. Blocking CCBot reduces the likelihood of your content appearing in Common Crawl datasets.

Check which crawlers can access your content

Test your robots.txt against specific AI crawler user agents. Verify that your blocking rules work correctly.

Testing that your rules actually work

Writing the rules is only half the job. You need to verify that they do what you intend. A typo in a User-agent name means the rule does nothing. A misplaced directive can block the wrong crawler.

Paste your robots.txt into the tester

Start with your current live robots.txt. Paste it into Robots.txt Tester and review the parsed output. Confirm that every User-agent block you expect is present and correctly formatted.

Test as Googlebot

Enter your key article URLs and test them as Googlebot. Every article, category page, and homepage should show as "Allowed." If any are blocked, your search traffic is at risk.

Test as each AI crawler

Switch the User-agent to GPTBot, then ClaudeBot, then Google-Extended. Test the same URLs. Every one should show as "Blocked." If any are allowed, your blocking rules have a gap.

Test edge cases

Check your RSS feeds, API endpoints, and AMP pages if you have them. Decide whether AI crawlers should access these alternative representations of your content and verify the rules match your intent.

Retest after every change

Any time you update your robots.txt -- adding a new AI crawler, adjusting rules for a new section, or restructuring your site -- run through these tests again. Rules that worked yesterday might not cover today's changes.

Selective blocking strategies for publishers

Not every publisher wants to block all AI crawlers completely. Some want nuanced control. Here are common strategies.

Block all AI training, allow search. The most common approach. Block GPTBot, ClaudeBot, Google-Extended, PerplexityBot, and CCBot. Allow Googlebot, Bingbot, and other search crawlers. This maximizes search visibility while denying AI training access.

Block AI training but allow AI search. Some publishers want to appear in AI-powered search results (like Perplexity or Google AI Overviews) but do not want their content used for model training. This requires understanding which crawlers serve which purpose -- Google-Extended is for AI training, while Googlebot handles both search and AI search features.

Block everything behind the paywall. If you have a metered or hard paywall, you might allow AI crawlers to access free content while blocking them from premium articles. Use path-based rules to protect specific directories.

Time-based approach. Some publishers block AI crawlers from recent content (the first 30 days) and allow access to archived material. While robots.txt itself does not support time-based rules, you can implement this by moving content between blocked and allowed paths as it ages.

New AI crawlers appear regularly

The list of AI crawlers is growing. What blocks AI access today may not cover new crawlers that emerge next month. Revisit your robots.txt periodically and check industry resources for newly announced crawler User-agent strings.

Verify your AI crawler blocking rules

Test your robots.txt against every major AI crawler user agent. Confirm your content is protected.

Pricing

Robots.txt Tester is free. Test your publisher robots.txt against any crawler user agent, as often as you need.

Free

Up to 3 items
Email alerts
Basic support

Pro

$9/month

Unlimited items
Email + Slack alerts
Priority support
API access

Part of Boring Tools -- boring tools for boring jobs.

Test your robots.txt for free

Validate your robots.txt file instantly. Check directives, find crawling issues, and ensure search engines can access your site.

robots.txt Testing for Content Publishers