AI Crawler Directory

18 crawlers from 12 companies — last updated May 2026

Every major AI company operates web crawlers that visit your site. Some collect training data for foundation models. Others power search features or respond to user requests. Understanding which is which lets you make informed robots.txt decisions — blocking training without losing visibility in AI-powered search.

QUICK LINKS

ROBOTS.TXT GENERATOR → BLOCK ALL AI CRAWLERS →

USER-AGENT	COMPANY	TYPE	ROBOTS.TXT
GPTBot	OpenAI	AI Training	✓ Respects
ChatGPT-User	OpenAI	User-Triggered Fetch	✓ Respects
OAI-SearchBot	OpenAI	AI Search Index	✓ Respects
ClaudeBot	Anthropic	AI Training	✓ Respects
Claude-User	Anthropic	User-Triggered Fetch	✓ Respects
Claude-SearchBot	Anthropic	AI Search Index	✓ Respects
anthropic-ai	Anthropic	Legacy / Deprecated	✓ Respects
Google-Extended	Google	AI Training	✓ Respects
Applebot-Extended	Apple	AI Training	✓ Respects
Meta-ExternalAgent	Meta	AI Training	⚠ Partial
FacebookBot	Meta	AI Feature Indexing	✓ Respects
PerplexityBot	Perplexity AI	AI Search Index	✓ Respects
CCBot	Common Crawl	Open Dataset	✓ Respects
Bytespider	ByteDance	AI Training	⚠ Partial
Amazonbot	Amazon	AI Feature Indexing	✓ Respects
Diffbot	Diffbot	Open Dataset	✓ Respects
DeepSeekBot	DeepSeek	AI Training	✓ Respects
cohere-ai	Cohere	AI Training	✓ Respects

OpenAI

GPTBot AI Training

Collects training data for GPT models

ChatGPT-User User-Triggered Fetch

Live page fetches triggered by ChatGPT users

OAI-SearchBot AI Search Index

Indexes content for ChatGPT Search citations

Anthropic

ClaudeBot AI Training

Collects training data for Claude models

Claude-User User-Triggered Fetch

Live page fetches triggered by Claude users

Claude-SearchBot AI Search Index

Indexes content for Claude search results

anthropic-ai Legacy / Deprecated

Deprecated training crawler identifier

Google

Google-Extended AI Training

Opt-out for Gemini and Vertex AI training

Apple

Applebot-Extended AI Training

Opt-out for Apple Intelligence training

Perplexity AI

PerplexityBot AI Search Index

Indexes content for Perplexity AI answers

Common Crawl

CCBot Open Dataset

Open web dataset used by many downstream AI models

ByteDance

Bytespider AI Training

Training data crawler attributed to ByteDance

Amazon

Amazonbot AI Feature Indexing

Indexes content for Alexa and Amazon assistant features

Diffbot

Diffbot Open Dataset

Knowledge graph extraction licensed by AI companies

DeepSeek

DeepSeekBot AI Training

Collects training data for DeepSeek AI models

Cohere

cohere-ai AI Training

Training data for Cohere's language models

Understanding AI Web Crawlers

AI web crawlers are automated programs that visit websites to collect content. Unlike traditional search engine crawlers (Googlebot, Bingbot) that build search indexes, AI crawlers serve several distinct purposes:

Training Crawlers

These crawlers collect web content to build training datasets for foundation models. GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google), and Meta-ExternalAgent (Meta) all fall into this category. Blocking them prevents your content from being included in future training runs but does not retroactively remove already-trained data.

Search and Retrieval Crawlers

OAI-SearchBot, Claude-SearchBot, and PerplexityBot index content so it can appear as cited sources in AI-powered search products. Blocking these crawlers removes your site from those products' search results — a significant trade-off if AI search drives traffic to your site.

User-Triggered Fetchers

ChatGPT-User and Claude-User only activate when a human asks the AI assistant to read a specific URL. They are not autonomous crawlers — they fetch one page at a time in response to user requests. Blocking them prevents the AI from citing your content when users explicitly request it.

The Recommended Approach

Most publishers who want to opt out of AI training while staying visible in AI search use this pattern:

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: CCBot
Disallow: /

# Allow search + user-triggered fetchers (keep visibility)
# OAI-SearchBot, ChatGPT-User, Claude-User,
# Claude-SearchBot, PerplexityBot — leave unblocked

For a ready-to-use template, try our Block AI Crawlers guide or generate a custom robots.txt with the Robots.txt Generator.

AI Crawler FAQ

How many AI crawlers are there in 2026?

The ai-robots-txt community project tracks over 100 user-agent strings associated with AI operations. This directory covers the 18 most impactful crawlers from major AI companies that website operators should know about.

Does blocking AI crawlers affect my Google Search rankings?

No. AI crawlers use separate user-agent strings from search engine crawlers. Blocking GPTBot, ClaudeBot, or Google-Extended does not affect Googlebot, Bingbot, or any search index. Google Search uses Googlebot exclusively for ranking.

Can I block AI training but stay visible in ChatGPT and Claude?

Yes. Block the training crawlers (GPTBot, ClaudeBot) while leaving the search and user-triggered crawlers (OAI-SearchBot, ChatGPT-User, Claude-User, Claude-SearchBot) allowed. Each operates independently.