本列表显示什么
- 每个主要 AI 爬虫的精确 User-agent 字符串,来源于厂商文档
- 每个爬虫是否遵守 robots.txt——以及例外存在于哪里
- 每个爬虫的用途:AI 训练、AI 搜索索引、用户触发抓取、传统搜索或共享数据集
完整的 AI 爬虫和 user-agent 参考列表——它们做什么、谁运行的、是否遵守 robots.txt。
| User-agent | 厂商 | 类别 | 遵守 robots.txt | |
|---|---|---|---|---|
| GPTBot | OpenAI | AI 训练 | 是 | |
| OAI-SearchBot | OpenAI | AI 搜索索引 | 是 | |
| ChatGPT-User | OpenAI | 用户触发抓取 | 是 | |
| ClaudeBot | Anthropic | AI 训练 | 是 | |
| Claude-SearchBot | Anthropic | AI 搜索索引 | 是 | |
| Claude-User | Anthropic | 用户触发抓取 | 是 | |
| Google-Extended | AI 训练 | 是 | ||
| GoogleOther | AI 训练 | 是 | ||
| Googlebot | 搜索引擎 | 是 | ||
| PerplexityBot | Perplexity | AI 搜索索引 | 是 | |
| Perplexity-User | Perplexity | 用户触发抓取 | 否 | |
| Applebot | Apple | 搜索引擎 | 是 | |
| Applebot-Extended | Apple | AI 训练 | 是 | |
| CCBot | Common Crawl | 共享数据集 | 是 | |
| Meta-ExternalAgent | Meta | AI 训练 | 是 | |
| Meta-ExternalFetcher | Meta | 用户触发抓取 | 是 | |
| Bytespider | ByteDance | AI 训练 | 部分 | |
| Amazonbot | Amazon | AI 搜索索引 | 是 | |
| DuckAssistBot | DuckDuckGo | AI 搜索索引 | 是 | |
| MistralAI-User | Mistral | 用户触发抓取 | 是 | |
| YouBot | You.com | AI 搜索索引 | 是 |
# AI crawler block list — generated from clickfrom.ai/tools/ai-crawler-user-agent-list # Remove the Disallow line for any crawler you want to allow. # OpenAI — Crawls public web pages to improve OpenAI foundation models. # Source: https://platform.openai.com/docs/bots User-agent: GPTBot Disallow: / # OpenAI — Indexes web pages so ChatGPT search and SearchGPT can cite them. # Source: https://platform.openai.com/docs/bots User-agent: OAI-SearchBot Disallow: / # OpenAI — Fetches a page on the spot when a ChatGPT user asks the assistant about a specific URL. # Source: https://platform.openai.com/docs/bots User-agent: ChatGPT-User Disallow: / # Anthropic — Crawls public web pages for Anthropic foundation-model training. # Source: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler User-agent: ClaudeBot Disallow: / # Anthropic — Indexes web pages so Claude can cite them in search-like answers. # Source: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler User-agent: Claude-SearchBot Disallow: / # Anthropic — Fetches a page on the spot when a Claude user asks the assistant about a specific URL. # Source: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler User-agent: Claude-User Disallow: / # Google — Opt-out token (not a real user-agent) controlling whether Gemini and Vertex AI may train on your content. # Source: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers#google-extended User-agent: Google-Extended Disallow: / # Google — Internal R&D and product-team crawls outside of Search and Ads. # Source: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers#googleother User-agent: GoogleOther Disallow: / # Google — Classical Google Search indexer. Powers AI Overviews via the same index. # Source: https://developers.google.com/search/docs/crawling-indexing/googlebot User-agent: Googlebot Disallow: / # Perplexity — Indexes web pages so Perplexity can surface them as cited sources in answers. # Source: https://docs.perplexity.ai/guides/bots User-agent: PerplexityBot Disallow: / # Perplexity — Fetches a page on the spot when a Perplexity user asks the assistant about a specific URL. # Source: https://docs.perplexity.ai/guides/bots User-agent: Perplexity-User Disallow: / # Apple — Powers Siri, Spotlight, and Safari Suggestions search. # Source: https://support.apple.com/en-us/119829 User-agent: Applebot Disallow: / # Apple — Opt-out token controlling whether Apple Intelligence may train on your content. # Source: https://support.apple.com/en-us/119829 User-agent: Applebot-Extended Disallow: / # Common Crawl — Bulk crawl of the public web. Downstream datasets feed many AI model training pipelines (including some at OpenAI, Anthropic, and academic groups). # Source: https://commoncrawl.org/ccbot User-agent: CCBot Disallow: / # Meta — Crawls public web pages for Meta AI (Llama family) training and indexing. # Source: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/ User-agent: Meta-ExternalAgent Disallow: / # Meta — Fetches a page on the spot when a Meta AI user asks the assistant about a specific URL. # Source: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/ User-agent: Meta-ExternalFetcher Disallow: / # ByteDance — Crawls public web pages for ByteDance's foundation-model training (Doubao and related models). # Source: https://bytespider.bytedance.com/ User-agent: Bytespider Disallow: / # Amazon — Powers Alexa and other Amazon answer/AI experiences. # Source: https://developer.amazon.com/amazonbot User-agent: Amazonbot Disallow: / # DuckDuckGo — Indexes web pages so DuckAssist can summarize them in DuckDuckGo answers. # Source: https://duckduckgo.com/duckduckgo-help-pages/results/duckassistbot/ User-agent: DuckAssistBot Disallow: / # Mistral — Fetches a page on the spot when a Mistral Le Chat user asks the assistant about a specific URL. # Source: https://docs.mistral.ai/robots/ User-agent: MistralAI-User Disallow: / # You.com — Indexes web pages for You.com AI search and chat. # Source: https://about.you.com/youbot/ User-agent: YouBot Disallow: /
只有当你按照爬虫自报的方式精确拼写 User-agent 时,robots.txt 规则才会生效。一个拼写错误(写成 “GPT-Bot” 而非 “GPTBot”)会静默失效。本列表直接从厂商公开文档中提取每个名称,让你的 robots.txt 真正做到你想做的事。
通常不应该——大多数 AI 爬虫正是购物者在 ChatGPT、Perplexity、Claude 和 Gemini 答案中找到你的途径。只屏蔽那些对你店铺价值不明的爬虫(如 Bytespider),或者那些你已经决定退出训练的 opt-out token(Google-Extended、Applebot-Extended)。
每当厂商发布新爬虫、弃用某个爬虫或改变其声明的 robots.txt 行为时。每个条目都链接到厂商来源,你可以直接核实。
因为厂商声明的行为和第三方审计不一致,或者厂商尚未公开明确立场。当现实情况更复杂时,我们不会编造一个干净的“是”。