Identify AI bots and crawlers

Reference table of common AI bot user-agent tokens by operator and category, plus how to verify that a request is genuine using reverse DNS and published IP ranges.

You identify AI traffic by the user-agent token each bot reports. This reference lists common tokens by operator, with the category each one belongs to. Use it to recognize AI traffic in your server logs and to write robots.txt rules.

Common AI user-agent tokens

The token is the part of the user-agent string you match in robots.txt and logs — for example, the full OpenAI string Mozilla/5.0 ... compatible; GPTBot/1.1; +https://openai.com/gptbot reports the token GPTBot.

Operator	Token	Category	Notes
OpenAI	`GPTBot`	Training crawler	Collects content for model training.
OpenAI	`OAI-SearchBot`	Search indexer	Indexes content for ChatGPT search.
OpenAI	`ChatGPT-User`	On-demand fetcher	Fetches a page when a user’s prompt references it.
Anthropic	`ClaudeBot`	Training crawler	Collects content for Claude model training.
Anthropic	`Claude-SearchBot`	Search indexer	Indexes content for Claude search results.
Anthropic	`Claude-User`	On-demand fetcher	Fetches a page for a Claude user request.
Anthropic	`anthropic-ai`, `Claude-Web`	Training crawler	Deprecated tokens; keep in block rules for safety.
Google	`Googlebot`	Search indexer	Search crawler; content can also surface in AI Overviews.
Google	`Google-Extended`	Training control token	A `robots.txt` token to control Gemini and Vertex AI training. It is not a separate crawler.
Google	`GoogleOther`, `Google-CloudVertexBot`	Crawler / fetcher	Used for research and Vertex AI fetches.
Perplexity	`PerplexityBot`	Search indexer	Indexes content for Perplexity answers.
Perplexity	`Perplexity-User`	On-demand fetcher	Fetches a page for a user’s Perplexity query.
Apple	`Applebot`	Search indexer	Powers Siri and Spotlight suggestions.
Apple	`Applebot-Extended`	Training control token	A `robots.txt` token to opt out of Apple Intelligence training. It is not a separate crawler.
Meta	`meta-externalagent`	Training crawler	Collects content for Meta AI model training.
Meta	`Meta-ExternalFetcher`	On-demand fetcher	Fetches a page for a user-initiated request.
Amazon	`Amazonbot`	Crawler	Crawls for Amazon services, including AI features.
ByteDance	`Bytespider`	Training crawler	Collects training data; has a history of inconsistent `robots.txt` compliance.
Common Crawl	`CCBot`	Open dataset crawler	Builds a public dataset widely used for model training.
Cohere	`cohere-ai`	Crawler	Fetches content for Cohere models.
DuckDuckGo	`DuckAssistBot`	On-demand fetcher	Fetches a page for DuckDuckGo’s assistant.

Verify a bot is genuine

A user-agent string is self-reported text, so anyone can copy a known token to disguise a scraper. Before you trust, rate-limit, or report a bot based on its name, confirm the request actually comes from the stated operator. Two methods are standard:

Published IP ranges

Some operators publish the IP addresses their bots use, so you can match the request’s source IP against the official list. OpenAI publishes JSON files for each of its bots — for example, GPTBot addresses at https://openai.com/gptbot.json. Other operators publish ranges on their crawler documentation pages. Match the request IP against the current published list before acting.

Forward-confirmed reverse DNS

When an operator does not publish IP ranges, use forward-confirmed reverse DNS (FCrDNS), the same method search engines recommend for verifying their crawlers:

Run a reverse DNS lookup on the request’s source IP to get a hostname.
Confirm the hostname belongs to the operator’s domain.
Run a forward DNS lookup on that hostname and confirm it resolves back to the original IP.

A request passes only if all three steps agree. This defeats spoofing, because attackers cannot control the operator’s DNS records.

Sources

For current, authoritative values, consult each operator’s official documentation and a maintained bot directory:

Next steps

Monitor AI traffic to your docs — turn these tokens into measurable reports.
Control AI crawler access — use these tokens to allow or block crawlers in robots.txt.