Skip to content

Identify AI bots and crawlers

Reference table of common AI bot user-agent tokens by operator and category, plus how to verify that a request is genuine using reverse DNS and published IP ranges.

You identify AI traffic by the user-agent token each bot reports. This reference lists common tokens by operator, with the category each one belongs to. Use it to recognize AI traffic in your server logs and to write robots.txt rules.

The token is the part of the user-agent string you match in robots.txt and logs — for example, the full OpenAI string Mozilla/5.0 ... compatible; GPTBot/1.1; +https://openai.com/gptbot reports the token GPTBot.

OperatorTokenCategoryNotes
OpenAIGPTBotTraining crawlerCollects content for model training.
OpenAIOAI-SearchBotSearch indexerIndexes content for ChatGPT search.
OpenAIChatGPT-UserOn-demand fetcherFetches a page when a user’s prompt references it.
AnthropicClaudeBotTraining crawlerCollects content for Claude model training.
AnthropicClaude-SearchBotSearch indexerIndexes content for Claude search results.
AnthropicClaude-UserOn-demand fetcherFetches a page for a Claude user request.
Anthropicanthropic-ai, Claude-WebTraining crawlerDeprecated tokens; keep in block rules for safety.
GoogleGooglebotSearch indexerSearch crawler; content can also surface in AI Overviews.
GoogleGoogle-ExtendedTraining control tokenA robots.txt token to control Gemini and Vertex AI training. It is not a separate crawler.
GoogleGoogleOther, Google-CloudVertexBotCrawler / fetcherUsed for research and Vertex AI fetches.
PerplexityPerplexityBotSearch indexerIndexes content for Perplexity answers.
PerplexityPerplexity-UserOn-demand fetcherFetches a page for a user’s Perplexity query.
AppleApplebotSearch indexerPowers Siri and Spotlight suggestions.
AppleApplebot-ExtendedTraining control tokenA robots.txt token to opt out of Apple Intelligence training. It is not a separate crawler.
Metameta-externalagentTraining crawlerCollects content for Meta AI model training.
MetaMeta-ExternalFetcherOn-demand fetcherFetches a page for a user-initiated request.
AmazonAmazonbotCrawlerCrawls for Amazon services, including AI features.
ByteDanceBytespiderTraining crawlerCollects training data; has a history of inconsistent robots.txt compliance.
Common CrawlCCBotOpen dataset crawlerBuilds a public dataset widely used for model training.
Coherecohere-aiCrawlerFetches content for Cohere models.
DuckDuckGoDuckAssistBotOn-demand fetcherFetches a page for DuckDuckGo’s assistant.

A user-agent string is self-reported text, so anyone can copy a known token to disguise a scraper. Before you trust, rate-limit, or report a bot based on its name, confirm the request actually comes from the stated operator. Two methods are standard:

Some operators publish the IP addresses their bots use, so you can match the request’s source IP against the official list. OpenAI publishes JSON files for each of its bots — for example, GPTBot addresses at https://openai.com/gptbot.json. Other operators publish ranges on their crawler documentation pages. Match the request IP against the current published list before acting.

When an operator does not publish IP ranges, use forward-confirmed reverse DNS (FCrDNS), the same method search engines recommend for verifying their crawlers:

  1. Run a reverse DNS lookup on the request’s source IP to get a hostname.
  2. Confirm the hostname belongs to the operator’s domain.
  3. Run a forward DNS lookup on that hostname and confirm it resolves back to the original IP.

A request passes only if all three steps agree. This defeats spoofing, because attackers cannot control the operator’s DNS records.

For current, authoritative values, consult each operator’s official documentation and a maintained bot directory: