# Identify AI bots and crawlers

import { Aside } from '@astrojs/starlight/components';

You identify AI traffic by the user-agent token each bot reports. This reference lists common tokens by operator, with the category each one belongs to. Use it to recognize AI traffic in your server logs and to write `robots.txt` rules.

<Aside type="caution">
Operators add, rename, and retire bots regularly, and user-agent strings include version numbers that change. Treat this table as a starting point and confirm current values against each operator's official bot documentation. The "Verify a bot is genuine" section explains why the token alone is not proof.
</Aside>

## Common AI user-agent tokens

The token is the part of the user-agent string you match in `robots.txt` and logs — for example, the full OpenAI string `Mozilla/5.0 ... compatible; GPTBot/1.1; +https://openai.com/gptbot` reports the token `GPTBot`.

| Operator | Token | Category | Notes |
|----------|-------|----------|-------|
| OpenAI | `GPTBot` | Training crawler | Collects content for model training. |
| OpenAI | `OAI-SearchBot` | Search indexer | Indexes content for ChatGPT search. |
| OpenAI | `ChatGPT-User` | On-demand fetcher | Fetches a page when a user's prompt references it. |
| Anthropic | `ClaudeBot` | Training crawler | Collects content for Claude model training. |
| Anthropic | `Claude-SearchBot` | Search indexer | Indexes content for Claude search results. |
| Anthropic | `Claude-User` | On-demand fetcher | Fetches a page for a Claude user request. |
| Anthropic | `anthropic-ai`, `Claude-Web` | Training crawler | Deprecated tokens; keep in block rules for safety. |
| Google | `Googlebot` | Search indexer | Search crawler; content can also surface in AI Overviews. |
| Google | `Google-Extended` | Training control token | A `robots.txt` token to control Gemini and Vertex AI training. It is not a separate crawler. |
| Google | `GoogleOther`, `Google-CloudVertexBot` | Crawler / fetcher | Used for research and Vertex AI fetches. |
| Perplexity | `PerplexityBot` | Search indexer | Indexes content for Perplexity answers. |
| Perplexity | `Perplexity-User` | On-demand fetcher | Fetches a page for a user's Perplexity query. |
| Apple | `Applebot` | Search indexer | Powers Siri and Spotlight suggestions. |
| Apple | `Applebot-Extended` | Training control token | A `robots.txt` token to opt out of Apple Intelligence training. It is not a separate crawler. |
| Meta | `meta-externalagent` | Training crawler | Collects content for Meta AI model training. |
| Meta | `Meta-ExternalFetcher` | On-demand fetcher | Fetches a page for a user-initiated request. |
| Amazon | `Amazonbot` | Crawler | Crawls for Amazon services, including AI features. |
| ByteDance | `Bytespider` | Training crawler | Collects training data; has a history of inconsistent `robots.txt` compliance. |
| Common Crawl | `CCBot` | Open dataset crawler | Builds a public dataset widely used for model training. |
| Cohere | `cohere-ai` | Crawler | Fetches content for Cohere models. |
| DuckDuckGo | `DuckAssistBot` | On-demand fetcher | Fetches a page for DuckDuckGo's assistant. |

<Aside type="note">
`Google-Extended` and `Applebot-Extended` are control tokens, not crawlers. No bot reports them as a user agent. You place them in `robots.txt` to opt out of AI training while the operator's search crawler (`Googlebot` or `Applebot`) continues to fetch your pages for search.
</Aside>

## Verify a bot is genuine

A user-agent string is self-reported text, so anyone can copy a known token to disguise a scraper. Before you trust, rate-limit, or report a bot based on its name, confirm the request actually comes from the stated operator. Two methods are standard:

### Published IP ranges

Some operators publish the IP addresses their bots use, so you can match the request's source IP against the official list. OpenAI publishes JSON files for each of its bots — for example, `GPTBot` addresses at `https://openai.com/gptbot.json`. Other operators publish ranges on their crawler documentation pages. Match the request IP against the current published list before acting.

### Forward-confirmed reverse DNS

When an operator does not publish IP ranges, use forward-confirmed reverse DNS (FCrDNS), the same method search engines recommend for verifying their crawlers:

1. Run a reverse DNS lookup on the request's source IP to get a hostname.
2. Confirm the hostname belongs to the operator's domain.
3. Run a forward DNS lookup on that hostname and confirm it resolves back to the original IP.

A request passes only if all three steps agree. This defeats spoofing, because attackers cannot control the operator's DNS records.

<Aside type="tip">
Always verify before you block or rate-limit by IP. Acting on the user-agent token alone risks blocking a genuine bot you meant to allow, or trusting a scraper that borrowed the name.
</Aside>

## Sources

For current, authoritative values, consult each operator's official documentation and a maintained bot directory:

- [OpenAI: Overview of OpenAI crawlers](https://platform.openai.com/docs/bots)
- [Anthropic: Does Anthropic crawl data from the web?](https://support.claude.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler)
- [Perplexity: Perplexity crawlers](https://docs.perplexity.ai/guides/bots)
- [Known Agents directory](https://knownagents.com/)

## Next steps

- [Monitor AI traffic to your docs](/guides/ai-traffic/monitor/) — turn these tokens into measurable reports.
- [Control AI crawler access](/guides/control-ai-access/) — use these tokens to allow or block crawlers in `robots.txt`.