robots.txt User-Agent: How to Target Specific Crawlers
How to use the User-agent directive in robots.txt to create rules for specific search engines, bots, and crawlers.
What the User-agent directive does
The User-agent line in robots.txt specifies which crawler the following rules apply to. Every Disallow and Allow rule belongs to the User-agent block above it. Without a User-agent line, rules have no target.
User-agent: Googlebot
Disallow: /private/
User-agent: Bingbot
Disallow: /old-content/
User-agent: *
Disallow: /admin/
This gives you granular control: different rules for different crawlers. You can let Google crawl pages that you block Bing from, or block AI training bots while allowing search engine crawlers full access.
The wildcard User-agent
The asterisk * is the catch-all. It matches any crawler that doesn't have its own specific block:
User-agent: *
Disallow: /admin/
Disallow: /tmp/
If a crawler shows up that isn't mentioned by name anywhere in your robots.txt, it follows the User-agent: * rules. If there's no wildcard block and no specific block matching the crawler's name, the bot assumes everything is allowed.
Most sites only need the wildcard. You add specific User-agent blocks when you need different behavior for different crawlers.
Targeting specific crawlers
Here's how to set rules for individual search engines:
# Google's primary web crawler
User-agent: Googlebot
Disallow: /internal-tools/
# Bing's crawler
User-agent: Bingbot
Disallow: /internal-tools/
Crawl-delay: 5
# Block AI training bots entirely
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
# Everyone else
User-agent: *
Disallow: /admin/
Allow: /
The crawler name in User-agent is case-insensitive per the spec, but convention is to use the standard casing. Crawlers match against the User-agent value using a substring match of their own bot name.
Case sensitivity
Per RFC 9309, User-agent matching is case-insensitive. Googlebot, googlebot, and GOOGLEBOT all match the same crawler. However, using the standard capitalization from the crawler's documentation is recommended for readability.
Common crawler names and what they do
Here's a reference of the crawlers you're most likely to encounter:
| User-agent | What it does |
|---|---|
| Googlebot | Google's primary web crawler for search indexing |
| Googlebot-Image | Google's image-specific crawler |
| Googlebot-News | Google News crawler |
| Google-Extended | Google's AI training data crawler |
| Bingbot | Microsoft Bing's search crawler |
| Yandex | Yandex search engine crawler (Russia) |
| Baiduspider | Baidu search engine crawler (China) |
| DuckDuckBot | DuckDuckGo's crawler |
| Slurp | Yahoo's legacy crawler |
| facebookexternalhit | Facebook's link preview crawler |
| Twitterbot | Twitter/X's link preview crawler |
| GPTBot | OpenAI's crawler for AI training |
| CCBot | Common Crawl's open web crawler |
| Applebot | Apple's crawler for Siri and Spotlight |
| AhrefsBot | Ahrefs SEO tool crawler |
| SemrushBot | Semrush SEO tool crawler |
Google operates several specialized crawlers beyond the main Googlebot. Each can be targeted independently. For example, you might allow Googlebot full access but restrict Googlebot-Image from certain directories.
Test your User-agent rules
Check how your robots.txt responds to different crawler user agents and verify your rules are working as intended.
Precedence rules: specific vs. wildcard
When a crawler finds both a specific block for its name and a wildcard block, it uses only the specific block. The wildcard is ignored entirely for that crawler.
User-agent: Googlebot
Disallow: /google-only-block/
User-agent: *
Disallow: /admin/
Disallow: /private/
In this example, Googlebot follows only its specific rules. It can access /admin/ and /private/ because those rules are in the wildcard block, not in the Googlebot block. This is a common source of bugs — people add a specific block for Googlebot and forget that it no longer inherits the wildcard rules.
If you want Googlebot to follow the same base rules plus additional ones, you must duplicate them:
User-agent: Googlebot
Disallow: /admin/
Disallow: /private/
Disallow: /google-only-block/
User-agent: *
Disallow: /admin/
Disallow: /private/
Specific blocks don't inherit wildcard rules
Once a crawler matches a named User-agent block, the wildcard block is completely ignored for that crawler. You must include all applicable rules in the specific block.
Multiple User-agent lines in one block
You can group multiple crawlers under the same set of rules by stacking User-agent lines before the first Disallow:
User-agent: Googlebot
User-agent: Bingbot
User-agent: DuckDuckBot
Disallow: /private/
Allow: /private/public-report/
All three crawlers follow the same rules. This is cleaner than duplicating the entire block three times. The key requirement: the User-agent lines must be consecutive, with no blank lines or other directives between them.
Blocking AI crawlers
A growing use case for targeted User-agent rules is blocking AI training crawlers while keeping your content visible in search results:
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: cohere-ai
Disallow: /
# Allow search engines
User-agent: *
Disallow: /admin/
Allow: /
This lets Googlebot, Bingbot, and other search crawlers access your content normally while preventing AI companies from using your content for model training. The list of AI crawlers grows regularly, so review and update this periodically.
Verify your AI bot blocking
Test your robots.txt against GPTBot, CCBot, and other AI crawlers to make sure your rules are correctly configured.
How to find what's crawling your site
Your robots.txt is only as good as your knowledge of what crawlers visit your site. Here are the primary ways to find out:
Server access logs
The definitive source. Check your web server logs for the User-agent header on incoming requests. Look for bot patterns and unfamiliar crawlers hitting your site at high frequency.
Google Search Console
Under Settings > Crawl stats, Google shows you exactly how Googlebot is crawling your site — pages crawled per day, response codes, and crawl requests over time.
Bing Webmaster Tools
Similar to Search Console, Bing provides crawl statistics and information about how Bingbot interacts with your site.
Web analytics
Some bots execute JavaScript and show up in analytics. Most don't. Analytics data is incomplete for bot identification but can catch some crawlers.
When you find an unwanted crawler hitting your site heavily, add a specific User-agent block to your robots.txt to manage it. If the crawler is malicious and doesn't respect robots.txt, you'll need server-level blocking (firewall rules, .htaccess, or CDN-level bot management) instead.
A complete User-agent configuration example
Here's a real-world robots.txt that demonstrates multiple User-agent strategies:
# Search engines: full access with minor restrictions
User-agent: Googlebot
User-agent: Bingbot
User-agent: DuckDuckBot
User-agent: Applebot
Disallow: /admin/
Disallow: /api/
Disallow: /search?
Disallow: /cart/
# Image crawlers: block user upload directories
User-agent: Googlebot-Image
Disallow: /uploads/private/
# SEO tool crawlers: limit crawl impact
User-agent: AhrefsBot
User-agent: SemrushBot
Crawl-delay: 10
Disallow: /admin/
# AI training crawlers: block entirely
User-agent: GPTBot
User-agent: CCBot
User-agent: Google-Extended
Disallow: /
# Default rules for everything else
User-agent: *
Disallow: /admin/
Disallow: /api/
Crawl-delay: 5
Sitemap: https://example.com/sitemap.xml
Each group of crawlers gets exactly the access level appropriate for its purpose. Search engines get broad access. SEO tools are rate-limited. AI crawlers are blocked. Everything else gets a conservative default.
Related Articles
Different crawlers need different rules. Your robots.txt should reflect that.
Test your robots.txt for free
Validate your robots.txt file instantly. Check directives, find crawling issues, and ensure search engines can access your site.