Identify AI bots and crawlers
Reference table of common AI bot user-agent tokens by operator and category, plus how to verify that a request is genuine using reverse DNS and published IP ranges.
You identify AI traffic by the user-agent token each bot reports. This reference lists common tokens by operator, with the category each one belongs to. Use it to recognize AI traffic in your server logs and to write robots.txt rules.
Common AI user-agent tokens
Section titled “Common AI user-agent tokens”The token is the part of the user-agent string you match in robots.txt and logs — for example, the full OpenAI string Mozilla/5.0 ... compatible; GPTBot/1.1; +https://openai.com/gptbot reports the token GPTBot.
| Operator | Token | Category | Notes |
|---|---|---|---|
| OpenAI | GPTBot | Training crawler | Collects content for model training. |
| OpenAI | OAI-SearchBot | Search indexer | Indexes content for ChatGPT search. |
| OpenAI | ChatGPT-User | On-demand fetcher | Fetches a page when a user’s prompt references it. |
| Anthropic | ClaudeBot | Training crawler | Collects content for Claude model training. |
| Anthropic | Claude-SearchBot | Search indexer | Indexes content for Claude search results. |
| Anthropic | Claude-User | On-demand fetcher | Fetches a page for a Claude user request. |
| Anthropic | anthropic-ai, Claude-Web | Training crawler | Deprecated tokens; keep in block rules for safety. |
Googlebot | Search indexer | Search crawler; content can also surface in AI Overviews. | |
Google-Extended | Training control token | A robots.txt token to control Gemini and Vertex AI training. It is not a separate crawler. | |
GoogleOther, Google-CloudVertexBot | Crawler / fetcher | Used for research and Vertex AI fetches. | |
| Perplexity | PerplexityBot | Search indexer | Indexes content for Perplexity answers. |
| Perplexity | Perplexity-User | On-demand fetcher | Fetches a page for a user’s Perplexity query. |
| Apple | Applebot | Search indexer | Powers Siri and Spotlight suggestions. |
| Apple | Applebot-Extended | Training control token | A robots.txt token to opt out of Apple Intelligence training. It is not a separate crawler. |
| Meta | meta-externalagent | Training crawler | Collects content for Meta AI model training. |
| Meta | Meta-ExternalFetcher | On-demand fetcher | Fetches a page for a user-initiated request. |
| Amazon | Amazonbot | Crawler | Crawls for Amazon services, including AI features. |
| ByteDance | Bytespider | Training crawler | Collects training data; has a history of inconsistent robots.txt compliance. |
| Common Crawl | CCBot | Open dataset crawler | Builds a public dataset widely used for model training. |
| Cohere | cohere-ai | Crawler | Fetches content for Cohere models. |
| DuckDuckGo | DuckAssistBot | On-demand fetcher | Fetches a page for DuckDuckGo’s assistant. |
Verify a bot is genuine
Section titled “Verify a bot is genuine”A user-agent string is self-reported text, so anyone can copy a known token to disguise a scraper. Before you trust, rate-limit, or report a bot based on its name, confirm the request actually comes from the stated operator. Two methods are standard:
Published IP ranges
Section titled “Published IP ranges”Some operators publish the IP addresses their bots use, so you can match the request’s source IP against the official list. OpenAI publishes JSON files for each of its bots — for example, GPTBot addresses at https://openai.com/gptbot.json. Other operators publish ranges on their crawler documentation pages. Match the request IP against the current published list before acting.
Forward-confirmed reverse DNS
Section titled “Forward-confirmed reverse DNS”When an operator does not publish IP ranges, use forward-confirmed reverse DNS (FCrDNS), the same method search engines recommend for verifying their crawlers:
- Run a reverse DNS lookup on the request’s source IP to get a hostname.
- Confirm the hostname belongs to the operator’s domain.
- Run a forward DNS lookup on that hostname and confirm it resolves back to the original IP.
A request passes only if all three steps agree. This defeats spoofing, because attackers cannot control the operator’s DNS records.
Sources
Section titled “Sources”For current, authoritative values, consult each operator’s official documentation and a maintained bot directory:
- OpenAI: Overview of OpenAI crawlers
- Anthropic: Does Anthropic crawl data from the web?
- Perplexity: Perplexity crawlers
- Known Agents directory
Next steps
Section titled “Next steps”- Monitor AI traffic to your docs — turn these tokens into measurable reports.
- Control AI crawler access — use these tokens to allow or block crawlers in
robots.txt.