GEO

robots.txt for AI crawlers: GPTBot, ClaudeBot and more

Which AI crawlers exist, which to let in and how to set up your robots.txt so AI can cite you. With ready-to-copy templates.

Share

robots.txt for AI crawlers: GPTBot, ClaudeBot and more

In the article on llms.txt we saw that the file doesn't control access. The one that does is robots.txt: that's where you decide which AI crawlers can read your site. And that decision has a direct consequence on whether AI can cite you or not.

robots.txt is where you control which AI crawlers can read your site. There are three types: training (GPTBot, ClaudeBot, CCBot, Google-Extended), search (OAI-SearchBot, Claude-SearchBot, PerplexityBot) and user-triggered fetch (ChatGPT-User, Claude-User, Perplexity-User). The golden rule: if you want to appear in AI answers, let the search and user agents in; blocking them removes you from citations. And note: robots.txt is voluntary, so some bots ignore it, and there only a server-level block protects you.

robots.txt vs llms.txt

Quickly, so you don't mix them up: llms.txt is a content guide (it suggests what to read); robots.txt is access control (it allows or blocks bots). If what you want is to decide who gets in, it's here.

The three types of AI crawler

Not all AI bots do the same thing:

  • Training: they collect content to train models (GPTBot, ClaudeBot, CCBot, Google-Extended).
  • Search: they index to answer in real time and are the ones that generate citations (OAI-SearchBot, Claude-SearchBot, PerplexityBot).
  • User-triggered fetch: they fetch a specific page when a user asks (ChatGPT-User, Claude-User, Perplexity-User).

Who's who (2026)

The user agents that matter:

  • OpenAI: GPTBot (training), OAI-SearchBot (search), ChatGPT-User (on user request).
  • Anthropic: ClaudeBot (training), Claude-SearchBot (search), Claude-User (on request).
  • Perplexity: PerplexityBot (index), Perplexity-User (on request).
  • Google: Google-Extended (controls use for Gemini without affecting Google ranking), Googlebot (search).
  • Apple: Applebot-Extended (Apple Intelligence training).
  • Common Crawl: CCBot (a dataset used to train many models).
  • Meta: Meta-ExternalAgent and FacebookBot.
  • ByteDance: Bytespider (famous for ignoring robots.txt).

The key decision: block or let in?

The practical rule: if you want to appear in AI answers, let the search and user bots in. Blocking them removes you from citations, and it's hard to reverse because models cache. Blocking the training bots is a legitimate intellectual-property choice, but it doesn't give you visibility, and your content may already be in earlier datasets. For a brand that wants to be cited, blocking is a deliberate trade-off, not the default.

If your goal is for AI to cite you, let the good bots in and reserve blocking for the problematic ones:

# Allow OpenAI
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /

# Allow Anthropic
User-agent: ClaudeBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: Claude-User
Allow: /

# Allow Perplexity
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /

# Allow Google (Gemini)
User-agent: Google-Extended
Allow: /

# Block the scraper that ignores rules
User-agent: Bytespider
Disallow: /

Selective option: visibility without training

If you want to appear in answers but not contribute to model training, block the training bots and keep the search and user ones:

# Block training
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /

# Allow search and user-triggered fetch
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /

One warning: don't block ChatGPT-User; it breaks a search the user explicitly requested.

Heads up: robots.txt is voluntary

robots.txt only works with bots that respect it. Bytespider and, according to an August 2025 Cloudflare report, undeclared Perplexity crawlers have been documented ignoring it or rotating identities. For those cases robots.txt isn't enough: you need a server-level or WAF block. And check your access logs every so often to see who's really coming in.

Common mistakes

  • Blocking the search bots and then wondering why AI doesn't cite you.
  • Confusing robots.txt (access) with llms.txt (content guide).
  • Blocking ChatGPT-User, which breaks the user's explicit request.
  • Believing robots.txt stops everyone: non-compliant bots ignore it.
  • Thinking that blocking now erases your content from already-trained models; it doesn't.

Frequently asked questions

Does blocking GPTBot remove me from ChatGPT?

Not entirely: GPTBot is for training. To appear in ChatGPT Search, OAI-SearchBot matters more.

What is Google-Extended?

The control to opt out of Gemini training without affecting your position in Google.

Should I block the training bots?

It's an intellectual-property decision; it doesn't give you visibility. If you want citations, let the search bots in.

Does robots.txt stop all bots?

No; some ignore it. For those, you need a server-level block.

How do I know who's visiting me?

By checking your access logs by user agent.

Federico Ergang

Written by

Federico Ergang

Cliro cofounder & CEO

Federico Ergang is cofounder and CEO of Cliro, the AI visibility and GEO platform for Latin America.

Related articles

How to appear in Perplexity
AI visibility

How to appear in Perplexity

Perplexity cites only a few sources per answer. How to make your brand one of them: crawlability, extractable content, freshness and authority.

May 20, 20263 min