robots.txt User-Agent: How to Target Specific Crawlers

What the User-agent directive does

The User-agent line in robots.txt specifies which crawler the following rules apply to. Every Disallow and Allow rule belongs to the User-agent block above it. Without a User-agent line, rules have no target. For the complete robots.txt reference, see our robots.txt Guide.

User-agent: Googlebot
Disallow: /private/

User-agent: Bingbot
Disallow: /old-content/

User-agent: *
Disallow: /admin/

This gives you granular control: different rules for different crawlers. You can let Google crawl pages that you block Bing from, or block AI training bots while allowing search engine crawlers full access.

The wildcard User-agent

The asterisk * is the catch-all. It matches any crawler that doesn't have its own specific block:

User-agent: *
Disallow: /admin/
Disallow: /tmp/

If a crawler shows up that isn't mentioned by name anywhere in your robots.txt, it follows the User-agent: * rules. If there's no wildcard block and no specific block matching the crawler's name, the bot assumes everything is allowed.

Most sites only need the wildcard. You add specific User-agent blocks when you need different behavior for different crawlers.

Targeting specific crawlers

Here's how to set rules for individual search engines:

# Google's primary web crawler
User-agent: Googlebot
Disallow: /internal-tools/

# Bing's crawler
User-agent: Bingbot
Disallow: /internal-tools/
Crawl-delay: 5

# Block AI training bots entirely
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# Everyone else
User-agent: *
Disallow: /admin/
Allow: /

The crawler name in User-agent is case-insensitive per the spec, but convention is to use the standard casing. Crawlers match against the User-agent value using a substring match of their own bot name.

Case sensitivity

Per RFC 9309 [1], User-agent matching is case-insensitive. Googlebot, googlebot, and GOOGLEBOT all match the same crawler. However, using the standard capitalization from the crawler's documentation is recommended for readability.

Common crawler names and what they do

Here's a reference of the crawlers you're most likely to encounter:

User-agent	What it does
Googlebot	Google's primary web crawler for search indexing
Googlebot-Image	Google's image-specific crawler
Googlebot-News	Google News crawler
Google-Extended	Google's AI training data crawler
Bingbot	Microsoft Bing's search crawler
Yandex	Yandex search engine crawler (Russia)
Baiduspider	Baidu search engine crawler (China)
DuckDuckBot	DuckDuckGo's crawler
Slurp	Yahoo's legacy crawler
facebookexternalhit	Facebook's link preview crawler
Twitterbot	Twitter/X's link preview crawler
GPTBot	OpenAI's crawler for AI training
CCBot	Common Crawl's open web crawler
Applebot	Apple's crawler for Siri and Spotlight
AhrefsBot	Ahrefs SEO tool crawler
SemrushBot	Semrush SEO tool crawler

Google operates several specialized crawlers beyond the main Googlebot. Each can be targeted independently. For example, you might allow Googlebot full access but restrict Googlebot-Image from certain directories.

Test your User-agent rules

Check how your robots.txt responds to different crawler user agents and verify your rules are working as intended.

Test Your robots.txt

Precedence rules: specific vs. wildcard

When a crawler finds both a specific block for its name and a wildcard block, it uses only the specific block. The wildcard is ignored entirely for that crawler.

User-agent: Googlebot
Disallow: /google-only-block/

User-agent: *
Disallow: /admin/
Disallow: /private/

In this example, Googlebot follows only its specific rules. It can access /admin/ and /private/ because those rules are in the wildcard block, not in the Googlebot block. This is a common source of bugs -- people add a specific block for Googlebot and forget that it no longer inherits the wildcard rules.

If you want Googlebot to follow the same base rules plus additional ones, you must duplicate them:

User-agent: Googlebot
Disallow: /admin/
Disallow: /private/
Disallow: /google-only-block/

User-agent: *
Disallow: /admin/
Disallow: /private/

Specific blocks don't inherit wildcard rules

Once a crawler matches a named User-agent block, the wildcard block is completely ignored for that crawler. You must include all applicable rules in the specific block.

Multiple User-agent lines in one block

You can group multiple crawlers under the same set of rules by stacking User-agent lines before the first Disallow:

User-agent: Googlebot
User-agent: Bingbot
User-agent: DuckDuckBot
Disallow: /private/
Allow: /private/public-report/

All three crawlers follow the same rules. This is cleaner than duplicating the entire block three times. The key requirement: the User-agent lines must be consecutive, with no blank lines or other directives between them.

Blocking AI crawlers

A growing use case for targeted User-agent rules is blocking AI training crawlers while keeping your content visible in search results:

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: cohere-ai
Disallow: /

# Allow search engines
User-agent: *
Disallow: /admin/
Allow: /

This lets Googlebot, Bingbot, and other search crawlers access your content normally while preventing AI companies from using your content for model training. See our complete guide to blocking AI crawlers with robots.txt for the full list of agents. The list of AI crawlers grows regularly, so review and update this periodically.

Verify your AI bot blocking

Test your robots.txt against GPTBot, CCBot, and other AI crawlers to make sure your rules are correctly configured.

Test Your robots.txt

How to find what's crawling your site

Your robots.txt is only as good as your knowledge of what crawlers visit your site. Here are the primary ways to find out:

Server access logs

The definitive source. Check your web server logs for the User-agent header on incoming requests. Look for bot patterns and unfamiliar crawlers hitting your site at high frequency.

Google Search Console

Under Settings > Crawl stats, Google shows you exactly how Googlebot is crawling your site -- pages crawled per day, response codes, and crawl requests over time.

Bing Webmaster Tools

Similar to Search Console, Bing provides crawl statistics and information about how Bingbot interacts with your site.

Web analytics

Some bots execute JavaScript and show up in analytics. Most don't. Analytics data is incomplete for bot identification but can catch some crawlers.

When you find an unwanted crawler hitting your site heavily, add a specific User-agent block to your robots.txt to manage it. You can also use Crawl-delay to throttle specific bots. If the crawler is malicious and doesn't respect robots.txt, you'll need server-level blocking (firewall rules, .htaccess, or CDN-level bot management) instead.

A complete User-agent configuration example

Here's a real-world robots.txt that demonstrates multiple User-agent strategies:

# Search engines: full access with minor restrictions
User-agent: Googlebot
User-agent: Bingbot
User-agent: DuckDuckBot
User-agent: Applebot
Disallow: /admin/
Disallow: /api/
Disallow: /search?
Disallow: /cart/

# Image crawlers: block user upload directories
User-agent: Googlebot-Image
Disallow: /uploads/private/

# SEO tool crawlers: limit crawl impact
User-agent: AhrefsBot
User-agent: SemrushBot
Crawl-delay: 10
Disallow: /admin/

# AI training crawlers: block entirely
User-agent: GPTBot
User-agent: CCBot
User-agent: Google-Extended
Disallow: /

# Default rules for everything else
User-agent: *
Disallow: /admin/
Disallow: /api/
Crawl-delay: 5

Sitemap: https://example.com/sitemap.xml

Each group of crawlers gets exactly the access level appropriate for its purpose. Search engines get broad access. SEO tools are rate-limited. AI crawlers are blocked. Everything else gets a conservative default.

References

Test your robots.txt for free

Validate your robots.txt file instantly. Check directives, find crawling issues, and ensure search engines can access your site.

Test Your robots.txt

What the User-agent directive does

The wildcard User-agent

Targeting specific crawlers

Common crawler names and what they do

Precedence rules: specific vs. wildcard

Multiple User-agent lines in one block

Blocking AI crawlers

How to find what's crawling your site

Server access logs

Google Search Console

Bing Webmaster Tools

Web analytics

A complete User-agent configuration example

References

Related Articles

Test your robots.txt for free