robots.txt Glossary: Every Directive and Term Explained
Definitions of every robots.txt directive and related term. User-agent, Disallow, Allow, Sitemap, Crawl-delay, and more.
Every term, directive, and concept related to robots.txt -- in plain English. Use this as a quick reference when you encounter unfamiliar terminology.
A
Allow
A robots.txt directive that permits crawling of a specific URL path, overriding a broader Disallow rule. Defined in RFC 9309. The most specific (longest) matching rule wins when Allow and Disallow conflict.
User-agent: *
Disallow: /docs/
Allow: /docs/public/
In this example, /docs/public/guide.html is crawlable even though /docs/ is blocked.
B
Bingbot
Microsoft Bing's primary web crawler. Identifies itself as bingbot in the User-agent string. Respects robots.txt rules, including Crawl-delay. You can manage its behavior through Bing Webmaster Tools or via robots.txt directives targeting Bingbot.
Bytespider
ByteDance's web crawler, associated with TikTok's parent company. Used for content indexing and, reportedly, AI training data collection. Many site owners block Bytespider in robots.txt alongside other AI crawlers. Identifies as Bytespider in the User-agent string.
C
CCBot
The crawler operated by Common Crawl, a nonprofit that builds and maintains an open repository of web crawl data. This dataset is widely used for AI model training, including by many large language model projects. Block it with User-agent: CCBot followed by Disallow: /.
ClaudeBot
Anthropic's web crawler used to gather training data for Claude AI models. Identifies as ClaudeBot in the User-agent string. Respects robots.txt rules. If you want to block Anthropic's crawler from your site, target ClaudeBot in your robots.txt.
Test your crawler rules
Check if your robots.txt correctly blocks or allows specific crawlers like ClaudeBot, GPTBot, and Bingbot.
Crawl-delay
A robots.txt directive that requests a crawler wait a specified number of seconds between successive requests. Not part of RFC 9309. Google ignores it entirely -- use Google Search Console to control Googlebot's crawl rate. Bing and Yandex support it.
User-agent: Bingbot
Crawl-delay: 10
Crawler
A program that systematically browses the web to discover and index content. Also called a spider, bot, or robot. Search engine crawlers like Googlebot follow links from page to page, reading content and adding it to the search engine's index. Crawlers check robots.txt before accessing a site.
D
Directive
A single instruction in a robots.txt file. The standard directives are User-agent, Disallow, Allow, and Sitemap. Extensions like Crawl-delay and Host are also directives, though not part of the official specification.
Disallow
The core robots.txt directive that tells a crawler not to access URLs matching the specified path. A Disallow: with an empty value means nothing is blocked. Disallow: / blocks the entire site for the specified user agent.
User-agent: *
Disallow: /private/
Disallow: /admin/
G
Googlebot
Google's primary web crawler. The most important crawler for SEO. Identifies as Googlebot in the User-agent string. Google also operates specialized crawlers like Googlebot-Image and Googlebot-News. If you create a specific Googlebot block in your robots.txt, it takes precedence over the wildcard * block.
Googlebot-Image
Google's image-specific crawler. Separate from the main Googlebot. If you want to allow your pages to be crawled but prevent your images from appearing in Google Images, you can specifically block Googlebot-Image. Note that blocking Googlebot also blocks Googlebot-Image, but blocking Googlebot-Image does not block Googlebot.
User-agent: Googlebot-Image
Disallow: /images/private/
Googlebot-News
Google's crawler for Google News content. Target it with User-agent: Googlebot-News when you want to control which content appears in Google News specifically. Blocking Googlebot-News does not affect your regular Google Search presence.
GPTBot
OpenAI's web crawler used to gather training data for GPT models. Identifies as GPTBot in the User-agent string. OpenAI states that GPTBot respects robots.txt. One of the most commonly blocked AI crawlers. A separate agent, ChatGPT-User, handles real-time web browsing in ChatGPT.
User-agent: GPTBot
Disallow: /
H
Host
A non-standard robots.txt directive used primarily by Yandex to specify the preferred domain (e.g., www.example.com vs example.com). Google and Bing ignore it. For domain preference, use canonical tags and 301 redirects instead.
N
Noindex (in robots.txt)
A directive that some crawlers historically supported in robots.txt to prevent indexing of specific URLs. Google previously supported Noindex: as an unofficial robots.txt directive but officially dropped support for it on September 1, 2019. Do not use Noindex in robots.txt. Use the <meta name="robots" content="noindex"> tag or the X-Robots-Tag: noindex HTTP header instead.
Noindex does not work in robots.txt
Google does not support a Noindex directive in robots.txt. If you need to prevent a page from being indexed, use the noindex meta tag in your HTML or the X-Robots-Tag HTTP header. Placing Noindex in robots.txt will have no effect.
P
PerplexityBot
The web crawler operated by Perplexity AI, used to index content for their AI-powered search engine. Identifies as PerplexityBot in the User-agent string. Like other AI crawlers, it can be blocked by targeting its user agent in robots.txt.
Check your AI crawler rules
Test whether your robots.txt correctly handles AI crawlers like GPTBot, ClaudeBot, and PerplexityBot.
R
Robots Exclusion Protocol
The standard that defines how robots.txt files work. Originally proposed by Martijn Koster in 1994 as an informal convention. Formalized as RFC 9309 in September 2022 by Google engineers along with contributions from other search engine companies. The protocol specifies how crawlers should request, parse, and apply robots.txt rules.
Robots.txt
A plain text file placed at the root of a website (/robots.txt) that communicates crawling permissions to web robots. The file uses the Robots Exclusion Protocol to define which user agents can access which parts of the site. It is advisory -- well-behaved crawlers respect it, but it is not an access control mechanism.
S
Sitemap
A robots.txt directive that specifies the location of an XML sitemap. Not tied to any User-agent block. The URL must be absolute (including the protocol). You can include multiple Sitemap directives in a single robots.txt file. This helps crawlers discover pages they might not find through link crawling alone.
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-blog.xml
Spider
Another term for a web crawler or bot. The name comes from the metaphor of a spider traversing the "web" of links between pages. Googlebot, Bingbot, and other crawlers are all spiders. The terms spider, crawler, bot, and robot are used interchangeably.
U
User-agent
A robots.txt directive that specifies which crawler the following rules apply to. Each User-agent line starts a new rule group. Use * as a wildcard to target all crawlers. The value is case-insensitive -- Googlebot and googlebot match the same crawler.
User-agent: *
Disallow: /private/
User-agent: Googlebot
Allow: /
If a crawler finds a rule group matching its specific name, it uses that group exclusively and ignores the wildcard group.
W
Wildcard
In robots.txt, the * character matches any sequence of characters in a Disallow or Allow path. The $ character anchors a match to the end of the URL. Wildcards are supported by Google, Bing, and most major crawlers, but are technically outside the scope of RFC 9309.
# Block all PDF files
Disallow: /*.pdf$
# Block all URLs with query strings
Disallow: /*?*
The * in User-agent: * is a different use of the wildcard -- it means "all crawlers" rather than being a pattern match.
Related Articles
Every term in one place. Bookmark this glossary.
Test your robots.txt for free
Validate your robots.txt file instantly. Check directives, find crawling issues, and ensure search engines can access your site.