本清單顯示什麼
- 每個主要 AI 爬蟲的精確 User-agent 字串,來源於廠商文件
- 每個爬蟲是否遵守 robots.txt——以及例外存在於哪裡
- 每個爬蟲的用途:AI 訓練、AI 搜尋索引、使用者觸發抓取、傳統搜尋或共享資料集
完整的 AI 爬蟲和 user-agent 參考清單——它們做什麼、誰運行的、是否遵守 robots.txt。
| User-agent | 廠商 | 類別 | 遵守 robots.txt | |
|---|---|---|---|---|
| GPTBot | OpenAI | AI 訓練 | 是 | |
| OAI-SearchBot | OpenAI | AI 搜尋索引 | 是 | |
| ChatGPT-User | OpenAI | 使用者觸發抓取 | 是 | |
| ClaudeBot | Anthropic | AI 訓練 | 是 | |
| Claude-SearchBot | Anthropic | AI 搜尋索引 | 是 | |
| Claude-User | Anthropic | 使用者觸發抓取 | 是 | |
| Google-Extended | AI 訓練 | 是 | ||
| GoogleOther | AI 訓練 | 是 | ||
| Googlebot | 搜尋引擎 | 是 | ||
| PerplexityBot | Perplexity | AI 搜尋索引 | 是 | |
| Perplexity-User | Perplexity | 使用者觸發抓取 | 否 | |
| Applebot | Apple | 搜尋引擎 | 是 | |
| Applebot-Extended | Apple | AI 訓練 | 是 | |
| CCBot | Common Crawl | 共享資料集 | 是 | |
| Meta-ExternalAgent | Meta | AI 訓練 | 是 | |
| Meta-ExternalFetcher | Meta | 使用者觸發抓取 | 是 | |
| Bytespider | ByteDance | AI 訓練 | 部分 | |
| Amazonbot | Amazon | AI 搜尋索引 | 是 | |
| DuckAssistBot | DuckDuckGo | AI 搜尋索引 | 是 | |
| MistralAI-User | Mistral | 使用者觸發抓取 | 是 | |
| YouBot | You.com | AI 搜尋索引 | 是 |
# AI crawler block list — generated from clickfrom.ai/tools/ai-crawler-user-agent-list # Remove the Disallow line for any crawler you want to allow. # OpenAI — Crawls public web pages to improve OpenAI foundation models. # Source: https://platform.openai.com/docs/bots User-agent: GPTBot Disallow: / # OpenAI — Indexes web pages so ChatGPT search and SearchGPT can cite them. # Source: https://platform.openai.com/docs/bots User-agent: OAI-SearchBot Disallow: / # OpenAI — Fetches a page on the spot when a ChatGPT user asks the assistant about a specific URL. # Source: https://platform.openai.com/docs/bots User-agent: ChatGPT-User Disallow: / # Anthropic — Crawls public web pages for Anthropic foundation-model training. # Source: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler User-agent: ClaudeBot Disallow: / # Anthropic — Indexes web pages so Claude can cite them in search-like answers. # Source: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler User-agent: Claude-SearchBot Disallow: / # Anthropic — Fetches a page on the spot when a Claude user asks the assistant about a specific URL. # Source: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler User-agent: Claude-User Disallow: / # Google — Opt-out token (not a real user-agent) controlling whether Gemini and Vertex AI may train on your content. # Source: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers#google-extended User-agent: Google-Extended Disallow: / # Google — Internal R&D and product-team crawls outside of Search and Ads. # Source: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers#googleother User-agent: GoogleOther Disallow: / # Google — Classical Google Search indexer. Powers AI Overviews via the same index. # Source: https://developers.google.com/search/docs/crawling-indexing/googlebot User-agent: Googlebot Disallow: / # Perplexity — Indexes web pages so Perplexity can surface them as cited sources in answers. # Source: https://docs.perplexity.ai/guides/bots User-agent: PerplexityBot Disallow: / # Perplexity — Fetches a page on the spot when a Perplexity user asks the assistant about a specific URL. # Source: https://docs.perplexity.ai/guides/bots User-agent: Perplexity-User Disallow: / # Apple — Powers Siri, Spotlight, and Safari Suggestions search. # Source: https://support.apple.com/en-us/119829 User-agent: Applebot Disallow: / # Apple — Opt-out token controlling whether Apple Intelligence may train on your content. # Source: https://support.apple.com/en-us/119829 User-agent: Applebot-Extended Disallow: / # Common Crawl — Bulk crawl of the public web. Downstream datasets feed many AI model training pipelines (including some at OpenAI, Anthropic, and academic groups). # Source: https://commoncrawl.org/ccbot User-agent: CCBot Disallow: / # Meta — Crawls public web pages for Meta AI (Llama family) training and indexing. # Source: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/ User-agent: Meta-ExternalAgent Disallow: / # Meta — Fetches a page on the spot when a Meta AI user asks the assistant about a specific URL. # Source: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/ User-agent: Meta-ExternalFetcher Disallow: / # ByteDance — Crawls public web pages for ByteDance's foundation-model training (Doubao and related models). # Source: https://bytespider.bytedance.com/ User-agent: Bytespider Disallow: / # Amazon — Powers Alexa and other Amazon answer/AI experiences. # Source: https://developer.amazon.com/amazonbot User-agent: Amazonbot Disallow: / # DuckDuckGo — Indexes web pages so DuckAssist can summarize them in DuckDuckGo answers. # Source: https://duckduckgo.com/duckduckgo-help-pages/results/duckassistbot/ User-agent: DuckAssistBot Disallow: / # Mistral — Fetches a page on the spot when a Mistral Le Chat user asks the assistant about a specific URL. # Source: https://docs.mistral.ai/robots/ User-agent: MistralAI-User Disallow: / # You.com — Indexes web pages for You.com AI search and chat. # Source: https://about.you.com/youbot/ User-agent: YouBot Disallow: /
只有當你按照爬蟲自報的方式精確拼寫 User-agent 時,robots.txt 規則才會生效。一個拼寫錯誤(寫成 「GPT-Bot」 而非 「GPTBot」)會靜默失效。本清單直接從廠商公開文件中提取每個名稱,讓你的 robots.txt 真正做到你想做的事。
通常不應該——大多數 AI 爬蟲正是購物者在 ChatGPT、Perplexity、Claude 和 Gemini 答案中找到你的途徑。只封鎖那些對你商店價值不明的爬蟲(如 Bytespider),或者那些你已經決定退出訓練的 opt-out token(Google-Extended、Applebot-Extended)。
每當廠商發佈新爬蟲、棄用某個爬蟲或改變其聲明的 robots.txt 行為時。每個條目都連結到廠商來源,你可以直接核實。
因為廠商聲明的行為和第三方稽核不一致,或者廠商尚未公開明確立場。當現實情況更複雜時,我們不會編造一個乾淨的「是」。