Known AI Crawlers — AgentsPop

Full crawler reference

User-Agent	Company	Purpose	Docs
GPTBot	OpenAI	ChatGPT training	docs ↗
OAI-SearchBot	OpenAI	SearchGPT retrieval	—
ClaudeBot	Anthropic	Claude training/RAG	docs ↗
Claude-Web	Anthropic	Claude.ai web search	docs ↗
Google-Extended	Google	Gemini training opt-out	docs ↗
PerplexityBot	Perplexity	Web retrieval	docs ↗
Meta-ExternalAgent	Meta	Meta AI training	—
Bytespider	ByteDance	TikTok/Doubao AI training	—
CCBot	Common Crawl	Open training datasets	docs ↗
Applebot	Apple	Siri/Spotlight/Suggestions	docs ↗
Applebot-Extended	Apple	Apple Intelligence opt-out	docs ↗
cohere-ai	Cohere	Model training	—
AI2Bot	Allen AI	Scientific AI research	docs ↗
YouBot	You.com	AI search	—
DuckAssistBot	DuckDuckGo	AI answers	—
Diffbot	Diffbot	Knowledge graph + training	docs ↗
Amazonbot	Amazon	Alexa / AI Assistants	docs ↗
PetalBot	Huawei	Petal Search AI	—
ImagesiftBot	ImageSift	Image training datasets	—

How to block AI crawlers in robots.txt

Add entries to your robots.txt (one block per crawler you want to restrict):

# Block OpenAI training crawler
User-agent: GPTBot
Disallow: /

# Block Anthropic training crawler
User-agent: ClaudeBot
Disallow: /

# Block Google Gemini training data opt-out
User-agent: Google-Extended
Disallow: /

# Block ByteDance / TikTok crawler
User-agent: Bytespider
Disallow: /

# Common Crawl (used by many open training datasets)
User-agent: CCBot
Disallow: /

robots.txt is advisory, not enforced

robots.txt is a convention, not a technical barrier. Reputable AI companies (OpenAI, Anthropic, Google) honor it. Less reputable operators may not. For enforcement, combine robots.txt with WAF rules in Cloudflare, Vercel, or your CDN — block by user-agent string at the edge.

External reference sources

For the most current and comprehensive bot lists:

Dashboard note

The Known AI Crawlers section on the main dashboard is currently empty because live scraping of bot registry lists is not yet implemented. This blog page serves as the static reference. See RECOMMENDED_FIXES.md for the fix proposal (Fix 11).

Full crawler reference

How to block AI crawlers in robots.txt

External reference sources

Related stats