robots.txt for AI crawlers: GPTBot, ClaudeBot and more
Which AI crawlers exist, which to let in and how to set up your robots.txt so AI can cite you. With ready-to-copy templates.

In the article on llms.txt we saw that the file doesn't control access. The one that does is robots.txt: that's where you decide which AI crawlers can read your site. And that decision has a direct consequence on whether AI can cite you or not.
robots.txt is where you control which AI crawlers can read your site. There are three types: training (GPTBot, ClaudeBot, CCBot, Google-Extended), search (OAI-SearchBot, Claude-SearchBot, PerplexityBot) and user-triggered fetch (ChatGPT-User, Claude-User, Perplexity-User). The golden rule: if you want to appear in AI answers, let the search and user agents in; blocking them removes you from citations. And note: robots.txt is voluntary, so some bots ignore it, and there only a server-level block protects you.
robots.txt vs llms.txt
Quickly, so you don't mix them up: llms.txt is a content guide (it suggests what to read); robots.txt is access control (it allows or blocks bots). If what you want is to decide who gets in, it's here.
The three types of AI crawler
Not all AI bots do the same thing:
- Training: they collect content to train models (GPTBot, ClaudeBot, CCBot, Google-Extended).
- Search: they index to answer in real time and are the ones that generate citations (OAI-SearchBot, Claude-SearchBot, PerplexityBot).
- User-triggered fetch: they fetch a specific page when a user asks (ChatGPT-User, Claude-User, Perplexity-User).
Who's who (2026)
The user agents that matter:
- OpenAI: GPTBot (training), OAI-SearchBot (search), ChatGPT-User (on user request).
- Anthropic: ClaudeBot (training), Claude-SearchBot (search), Claude-User (on request).
- Perplexity: PerplexityBot (index), Perplexity-User (on request).
- Google: Google-Extended (controls use for Gemini without affecting Google ranking), Googlebot (search).
- Apple: Applebot-Extended (Apple Intelligence training).
- Common Crawl: CCBot (a dataset used to train many models).
- Meta: Meta-ExternalAgent and FacebookBot.
- ByteDance: Bytespider (famous for ignoring robots.txt).
The key decision: block or let in?
The practical rule: if you want to appear in AI answers, let the search and user bots in. Blocking them removes you from citations, and it's hard to reverse because models cache. Blocking the training bots is a legitimate intellectual-property choice, but it doesn't give you visibility, and your content may already be in earlier datasets. For a brand that wants to be cited, blocking is a deliberate trade-off, not the default.
Recommended robots.txt if you want visibility
If your goal is for AI to cite you, let the good bots in and reserve blocking for the problematic ones:
# Allow OpenAI
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
# Allow Anthropic
User-agent: ClaudeBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: Claude-User
Allow: /
# Allow Perplexity
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
# Allow Google (Gemini)
User-agent: Google-Extended
Allow: /
# Block the scraper that ignores rules
User-agent: BytespiderDisallow: /
Selective option: visibility without training
If you want to appear in answers but not contribute to model training, block the training bots and keep the search and user ones:
# Block training
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
# Allow search and user-triggered fetch
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBotAllow: /
One warning: don't block ChatGPT-User; it breaks a search the user explicitly requested.
Heads up: robots.txt is voluntary
robots.txt only works with bots that respect it. Bytespider and, according to an August 2025 Cloudflare report, undeclared Perplexity crawlers have been documented ignoring it or rotating identities. For those cases robots.txt isn't enough: you need a server-level or WAF block. And check your access logs every so often to see who's really coming in.
Common mistakes
- Blocking the search bots and then wondering why AI doesn't cite you.
- Confusing robots.txt (access) with llms.txt (content guide).
- Blocking ChatGPT-User, which breaks the user's explicit request.
- Believing robots.txt stops everyone: non-compliant bots ignore it.
- Thinking that blocking now erases your content from already-trained models; it doesn't.
Frequently asked questions
Does blocking GPTBot remove me from ChatGPT?
Not entirely: GPTBot is for training. To appear in ChatGPT Search, OAI-SearchBot matters more.
What is Google-Extended?
The control to opt out of Gemini training without affecting your position in Google.
Should I block the training bots?
It's an intellectual-property decision; it doesn't give you visibility. If you want citations, let the search bots in.
Does robots.txt stop all bots?
No; some ignore it. For those, you need a server-level block.
How do I know who's visiting me?
By checking your access logs by user agent.

Written by
Federico Ergang
Cliro cofounder & CEO
Federico Ergang is cofounder and CEO of Cliro, the AI visibility and GEO platform for Latin America.
Related articles
What is llms.txt and how to create yours
What the llms.txt file is, how to create yours step by step and the truth about its adoption: why it's a low-cost, low-yield bet right now.
How to appear in Perplexity
Perplexity cites only a few sources per answer. How to make your brand one of them: crawlability, extractable content, freshness and authority.
