Learn which AI bots crawl websites (training, search, and user-triggered fetchers), how to identify them, and how robots.txt affects visibility.
If you've checked your server logs lately, you've probably noticed some unfamiliar names: GPTBot, ClaudeBot, PerplexityBot. These aren't your parents' search engine crawlers.
We see questions about these bots constantly, so let's break down who they are, what they want, and whether you should let them in.
Not all AI bots are the same. There are three categories:
The distinction matters. Blocking training bots doesn't necessarily make you invisible to AI. Search bots and user fetchers can still find you.
Want the full optimization playbook? See: How to Improve Your AI-Readiness Score.
OpenAI actually runs three different bots, which is smart: it lets you make granular decisions about what you allow.
GPTBot — the training crawler. This is what collects content to improve their models.OAI-SearchBot — powers ChatGPT's search feature. If you want to show up in ChatGPT search results, this is the one that matters.ChatGPT-User — fires when a user explicitly asks ChatGPT to "look at this page." It's basically acting on behalf of a human.Anthropic follows a similar pattern—separate tokens for separate purposes:
ClaudeBot — training data collectionClaude-User — when a Claude user asks it to fetch a specific URLClaude-SearchBot — crawls for Claude's search capabilitiesPerplexity is interesting because it's search-first—their whole product is about finding and citing sources.
PerplexityBot — their main search crawlerPerplexity-User — user-triggered fetchesHeads up: Perplexity's docs say their user-triggered fetcher generally ignores robots.txt. The logic is that if a human asked the AI to look at a page, blocking it would be like blocking a human with a browser. Controversial, but that's their stance.
This one confuses people. Google-Extended is not Googlebot. It's a separate token specifically for Gemini (Google's AI) training and grounding.
Blocking Google-Extended does not affect your Google Search rankings. Google has been explicit about this. It only controls whether your content gets used in Gemini's training data.
Technically, they're pretty similar. They make HTTP requests, parse HTML, follow links. The difference is what happens after they crawl:
One thing we've noticed: AI crawlers are less predictable than Googlebot. They don't follow a neat schedule. Some sites see them daily, others weekly, others rarely.
We don't have insider knowledge of their ranking algorithms (nobody outside these companies does), but based on what we've observed and what these companies have said publicly:
Content that works:
Technical stuff that helps:
Trust signals:
This is what we recommend for most public sites. Allow everything:
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: Google-Extended
Allow: /
You can also take a middle-ground approach: allow search bots but block training bots. That way you can appear in AI search results without your content being used to train models.
Your call. Here's how:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
Most sites have some pages that should be public and others that shouldn't. Here's an example:
User-agent: GPTBot
Allow: /blog/
Allow: /products/
Disallow: /admin/
Disallow: /private/
Important: robots.txt controls crawling, not indexing. If a URL is linked from somewhere else, search engines might still index it even if you've blocked crawling. For truly private content, use authentication or noindex tags.
We'll be direct: if AI can't read your site, it can't cite you.
When someone asks ChatGPT or Perplexity about your industry, do you show up? If you've blocked their crawlers, probably not. Your competitors who are accessible will get mentioned instead.
That said, there are legitimate reasons to block training crawlers—copyright concerns, licensing issues, competitive intelligence. Just know the tradeoff.
Check your logs. Do you even know which bots are visiting? We've talked to site owners who had no idea GPTBot was hitting them daily.
Keep content fresh. We've noticed AI systems tend to prefer recent content. An article from 2019 is less likely to be cited than one from 2024.
Use structured data. It's not required, but it helps. See our guide: Structured Data for AI.
Test your AI Visibility. We built friendly4AI specifically for this—scan your site and see what AI crawlers see. To understand how LLMs decide which sites to mention, read What Is AI Visibility? and How LLMs Choose Which Websites to Recommend.
Grep for these User-Agent strings: GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, Claude-User, PerplexityBot, Perplexity-User, Google-Extended.
If you're using a CDN or analytics platform, many now have built-in bot detection that separates AI crawlers from regular traffic.
Not entirely. You can block training bots (GPTBot, ClaudeBot) while allowing search bots (OAI-SearchBot, PerplexityBot). The search bots can still find and cite you—your content just won't be used to train models.
We've linked to the official sources here so you can verify everything yourself:
This space is evolving fast. New bots appear, policies change, tokens get added. We try to keep this page updated, but always check the official docs for the latest.



