Full crawler reference
| User-Agent | Company | Purpose | Docs |
|---|---|---|---|
| GPTBot | OpenAI | ChatGPT training | docs ↗ |
| OAI-SearchBot | OpenAI | SearchGPT retrieval | — |
| ClaudeBot | Anthropic | Claude training/RAG | docs ↗ |
| Claude-Web | Anthropic | Claude.ai web search | docs ↗ |
| Google-Extended | Gemini training opt-out | docs ↗ | |
| PerplexityBot | Perplexity | Web retrieval | docs ↗ |
| Meta-ExternalAgent | Meta | Meta AI training | — |
| Bytespider | ByteDance | TikTok/Doubao AI training | — |
| CCBot | Common Crawl | Open training datasets | docs ↗ |
| Applebot | Apple | Siri/Spotlight/Suggestions | docs ↗ |
| Applebot-Extended | Apple | Apple Intelligence opt-out | docs ↗ |
| cohere-ai | Cohere | Model training | — |
| AI2Bot | Allen AI | Scientific AI research | docs ↗ |
| YouBot | You.com | AI search | — |
| DuckAssistBot | DuckDuckGo | AI answers | — |
| Diffbot | Diffbot | Knowledge graph + training | docs ↗ |
| Amazonbot | Amazon | Alexa / AI Assistants | docs ↗ |
| PetalBot | Huawei | Petal Search AI | — |
| ImagesiftBot | ImageSift | Image training datasets | — |
How to block AI crawlers in robots.txt
Add entries to your robots.txt (one block per crawler you want to restrict):
# Block OpenAI training crawler User-agent: GPTBot Disallow: / # Block Anthropic training crawler User-agent: ClaudeBot Disallow: / # Block Google Gemini training data opt-out User-agent: Google-Extended Disallow: / # Block ByteDance / TikTok crawler User-agent: Bytespider Disallow: / # Common Crawl (used by many open training datasets) User-agent: CCBot Disallow: /
robots.txt is advisory, not enforced
robots.txt is a convention, not a technical barrier. Reputable AI companies (OpenAI, Anthropic, Google) honor it. Less reputable operators may not. For enforcement, combine robots.txt with WAF rules in Cloudflare, Vercel, or your CDN — block by user-agent string at the edge.
External reference sources
For the most current and comprehensive bot lists:
- Cloudflare Radar — Known Bots
- ai-robots-txt on GitHub — community-maintained list
- Barracuda Bad Bot Glossary
Dashboard note
The Known AI Crawlers section on the main dashboard is currently empty because live scraping of bot registry lists is not yet implemented. This blog page serves as the static reference. See RECOMMENDED_FIXES.md for the fix proposal (Fix 11).