Learn which AI bots crawl websites (training, search, and user-triggered fetchers), how to identify them, and how robots.txt affects visibility.
AI crawlers fall into three categories: training bots, search bots, and user-triggered fetchers. The table below lists the user-agent tokens most likely to appear in your access logs in 2026, the company behind each one, and what that bot actually does with your content.
| Bot name | User-agent token | Purpose |
|---|---|---|
| OpenAI (training) | GPTBot | Collects pages for future GPT model training |
| OpenAI (user fetch) | ChatGPT-User | Fetches a specific URL when a ChatGPT user asks the model to read it |
| OpenAI (search) | OAI-SearchBot | Powers ChatGPT's search feature and in-answer citations |
| Anthropic (training) | ClaudeBot | Collects pages for Claude model training |
| Anthropic (legacy) | Claude-Web | Earlier Anthropic crawler, still seen in some logs |
| Perplexity (search) | PerplexityBot | Main Perplexity crawler that feeds answer citations |
| Perplexity (user fetch) | Perplexity-User | Fetches a URL on behalf of a Perplexity user; per Perplexity's docs, ignores robots.txt |
| Google (AI products) | GoogleOther | Google internal and product fetches that sit outside Google Search |
| ByteDance (training) | Bytespider | Collects content for ByteDance AI products, including TikTok Doubao |
| Apple (AI training) | Applebot-Extended | Controls whether Apple Intelligence can train on your content |
For the pattern behind this split, and how to choose what to allow, read on.
Not all AI bots are the same. There are three categories:
The distinction matters. Blocking training bots doesn't necessarily make you invisible to AI. Search bots and user fetchers can still find you.
Want the full optimization playbook? See: How to Improve Your AI-Readiness Score.
OpenAI actually runs three different bots, which is smart: it lets you make granular decisions about what you allow.
GPTBot — the training crawler. This is what collects content to improve their models.OAI-SearchBot — powers ChatGPT's search feature. If you want to show up in ChatGPT search results, this is the one that matters.ChatGPT-User — fires when a user explicitly asks ChatGPT to "look at this page." It's basically acting on behalf of a human.Anthropic follows a similar pattern, separate tokens for separate purposes:
ClaudeBot — training data collectionClaude-User — when a Claude user asks it to fetch a specific URLClaude-SearchBot — crawls for Claude's search capabilitiesPerplexity is interesting because it's search-first, their whole product is about finding and citing sources.
PerplexityBot — their main search crawlerPerplexity-User — user-triggered fetchesHeads up: Perplexity's docs say their user-triggered fetcher generally ignores robots.txt. The logic is that if a human asked the AI to look at a page, blocking it would be like blocking a human with a browser. Controversial, but that's their stance.
This one confuses people. Google-Extended is not Googlebot. It's a separate token specifically for Gemini (Google's AI) training and grounding.
Blocking Google-Extended does not affect your Google Search rankings. Google has been explicit about this. It only controls whether your content gets used in Gemini's training data.
Technically, they're pretty similar. They make HTTP requests, parse HTML, follow links. The difference is what happens after they crawl:
One thing we've noticed: AI crawlers are less predictable than Googlebot. They don't follow a neat schedule. Some sites see them daily, others weekly, others rarely.
We don't have insider knowledge of their ranking algorithms (nobody outside these companies does), but based on what we've observed and what these companies have said publicly:
Content that works:
Technical stuff that helps:
Trust signals:
This is what we recommend for most public sites. Allow everything:
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: Google-Extended
Allow: /
You can also take a middle-ground approach: allow search bots but block training bots. That way you can appear in AI search results without your content being used to train models.
Your call. Here's how:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
Most sites have some pages that should be public and others that shouldn't. Here's an example:
User-agent: GPTBot
Allow: /blog/
Allow: /products/
Disallow: /admin/
Disallow: /private/
Important: robots.txt controls crawling, not indexing. If a URL is linked from somewhere else, search engines might still index it even if you've blocked crawling. For truly private content, use authentication or noindex tags.
We'll be direct: if AI can't read your site, it can't cite you.
When someone asks ChatGPT or Perplexity about your industry, do you show up? If you've blocked their crawlers, probably not. Your competitors who are accessible will get mentioned instead.
That said, there are legitimate reasons to block training crawlers, copyright concerns, licensing issues, competitive intelligence. Just know the tradeoff.
Check your logs. Do you even know which bots are visiting? We've talked to site owners who had no idea GPTBot was hitting them daily.
Keep content fresh. We've noticed AI systems tend to prefer recent content. An article from 2019 is less likely to be cited than one from 2024.
Use structured data. It's not required, but it helps. See our guide: Structured Data for AI.
Test your AI Visibility. We built friendly4AI specifically for this — test your site's AI-readiness and see exactly what AI crawlers see. To understand how LLMs decide which sites to mention, read What Is AI Visibility? and How LLMs Choose Which Websites to Recommend.
OAI-SearchBot is OpenAI's search-specific crawler, separate from the training bot GPTBot. It powers the search feature inside ChatGPT — when a user asks ChatGPT about recent information, OAI-SearchBot fetches the candidate pages and the answer cites them. Blocking OAI-SearchBot removes your site from ChatGPT search results. Allowing it does not, by itself, add your content to GPT training data: that is GPTBot's job.
Block GPTBot only if you have a specific reason — copyright, licensing, exclusivity, or a contractual restriction. For most public sites, blocking GPTBot costs AI visibility: your pages will not be used in model training, which reduces the chance ChatGPT surfaces your brand when answering questions in your topic area. A middle path is to block GPTBot (training) while allowing OAI-SearchBot (search), which keeps you cite-able in ChatGPT without feeding the model.
No. ChatGPT-User is a user-triggered fetcher that fires when a person inside ChatGPT pastes a URL and asks the model to read it. It does not touch Bing's index. Bingbot is Microsoft's separate search crawler, and Copilot runs its own fetcher on top of that — ChatGPT-User blocks affect only on-demand fetches from ChatGPT sessions, not anything indexed by Bing.
Grep your access logs for these User-Agent strings: GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, Claude-User, PerplexityBot, Perplexity-User, Google-Extended, GoogleOther, Bytespider, Applebot-Extended. Most CDNs (Cloudflare, Fastly) and analytics platforms also include built-in bot categorization that separates AI crawlers from human traffic and regular search bots.
Not entirely. You can block training bots (GPTBot, ClaudeBot, Google-Extended) while allowing search bots (OAI-SearchBot, PerplexityBot, Claude-SearchBot). The search bots still find, fetch, and cite your pages when answering user questions. Your content simply will not be used to train the next version of the underlying model.
We've linked to the official sources here so you can verify everything yourself:
This space is evolving fast. New bots appear, policies change, tokens get added. We try to keep this page updated, but always check the official docs for the latest.


