Reference
Known AI Crawlers
20+ documented
Source: AgentsPop + Cloudflare Radar + Barracuda As of: 2026-03-12

Known AI crawlers with user-agent names for robots.txt configuration.

Full crawler reference

User-Agent Company Purpose Docs
GPTBot OpenAI ChatGPT training docs ↗
OAI-SearchBot OpenAI SearchGPT retrieval
ClaudeBot Anthropic Claude training/RAG docs ↗
Claude-Web Anthropic Claude.ai web search docs ↗
Google-Extended Google Gemini training opt-out docs ↗
PerplexityBot Perplexity Web retrieval docs ↗
Meta-ExternalAgent Meta Meta AI training
Bytespider ByteDance TikTok/Doubao AI training
CCBot Common Crawl Open training datasets docs ↗
Applebot Apple Siri/Spotlight/Suggestions docs ↗
Applebot-Extended Apple Apple Intelligence opt-out docs ↗
cohere-ai Cohere Model training
AI2Bot Allen AI Scientific AI research docs ↗
YouBot You.com AI search
DuckAssistBot DuckDuckGo AI answers
Diffbot Diffbot Knowledge graph + training docs ↗
Amazonbot Amazon Alexa / AI Assistants docs ↗
PetalBot Huawei Petal Search AI
ImagesiftBot ImageSift Image training datasets

How to block AI crawlers in robots.txt

Add entries to your robots.txt (one block per crawler you want to restrict):

# Block OpenAI training crawler
User-agent: GPTBot
Disallow: /

# Block Anthropic training crawler
User-agent: ClaudeBot
Disallow: /

# Block Google Gemini training data opt-out
User-agent: Google-Extended
Disallow: /

# Block ByteDance / TikTok crawler
User-agent: Bytespider
Disallow: /

# Common Crawl (used by many open training datasets)
User-agent: CCBot
Disallow: /
robots.txt is advisory, not enforced

robots.txt is a convention, not a technical barrier. Reputable AI companies (OpenAI, Anthropic, Google) honor it. Less reputable operators may not. For enforcement, combine robots.txt with WAF rules in Cloudflare, Vercel, or your CDN — block by user-agent string at the edge.

External reference sources

For the most current and comprehensive bot lists:

Dashboard note

The Known AI Crawlers section on the main dashboard is currently empty because live scraping of bot registry lists is not yet implemented. This blog page serves as the static reference. See RECOMMENDED_FIXES.md for the fix proposal (Fix 11).

Related stats