Skip to content

Control AI crawler access

Use robots.txt to decide which AI crawlers may access your documentation — block training crawlers while allowing search and on-demand fetchers — with a copy-paste template.

You decide which AI systems may use your documentation. The standard lever is robots.txt, where you allow or disallow individual crawlers by their user-agent token. This guide gives you a recommended strategy and a copy-paste template.

Publish a robots.txt that opts your content out of AI model training while keeping it available to the AI search and assistant traffic that sends readers your way.

  • The ability to serve a file at your site root (https://docs.example.com/robots.txt).
  • The user-agent tokens you want to target. See Identify AI bots and crawlers.

The strategy: block training, allow search and on-demand

Section titled “The strategy: block training, allow search and on-demand”

AI crawlers fall into categories that the bot reference describes in full. For documentation, a common position is:

  • Block training crawlers — they consume bandwidth to build model datasets and return little direct value (for example, GPTBot, ClaudeBot, CCBot, Bytespider, meta-externalagent).
  • Block training with opt-out tokensGoogle-Extended and Applebot-Extended opt you out of Gemini and Apple Intelligence training without affecting search.
  • Allow search indexers and on-demand fetchers — these surface your docs in AI answers and fetch pages when a user asks a question, which sends qualified readers to you (for example, OAI-SearchBot, Claude-SearchBot, PerplexityBot, ChatGPT-User, Claude-User).

Save this as robots.txt at your site root, then adjust the lists to match your policy. Anything not listed falls through to your existing default rules.

# Block AI training crawlers
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: anthropic-ai
User-agent: CCBot
User-agent: Bytespider
User-agent: meta-externalagent
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: cohere-ai
Disallow: /
# Allow AI search indexers and on-demand fetchers
User-agent: OAI-SearchBot
User-agent: Claude-SearchBot
User-agent: PerplexityBot
User-agent: ChatGPT-User
User-agent: Claude-User
Allow: /

robots.txt must live at your domain root, the same level as sitemap.xml. How you publish it depends on your platform:

  • Static sites (Astro, Docusaurus, and others) — place robots.txt in the directory served at the root, such as public/.
  • Mintlify — add the file through your project’s static asset configuration.
  • Subdomains are separate — a docs subdomain (docs.example.com) needs its own robots.txt; the root domain’s file does not apply to it.
  1. Confirm the file is served at the root:

    Terminal window
    curl https://docs.example.com/robots.txt
  2. Confirm your block and allow rules read as you intend, and that the tokens match the current values in the bot reference.

  3. After deploying, monitor your logs to confirm the crawlers you blocked stop appearing — and to catch any that ignore the rules.