AI 爬蟲 User-Agent 清單

完整的 AI 爬蟲和 user-agent 參考清單——它們做什麼、誰運行的、是否遵守 robots.txt。

搜尋類別Robots.txt 行為

顯示 21 個爬蟲

User-agent	廠商	類別	遵守 robots.txt
GPTBot	OpenAI	AI 訓練	是
OAI-SearchBot	OpenAI	AI 搜尋索引	是
ChatGPT-User	OpenAI	使用者觸發抓取	是
ClaudeBot	Anthropic	AI 訓練	是
Claude-SearchBot	Anthropic	AI 搜尋索引	是
Claude-User	Anthropic	使用者觸發抓取	是
Google-Extended	Google	AI 訓練	是
GoogleOther	Google	AI 訓練	是
Googlebot	Google	搜尋引擎	是
PerplexityBot	Perplexity	AI 搜尋索引	是
Perplexity-User	Perplexity	使用者觸發抓取	否
Applebot	Apple	搜尋引擎	是
Applebot-Extended	Apple	AI 訓練	是
CCBot	Common Crawl	共享資料集	是
Meta-ExternalAgent	Meta	AI 訓練	是
Meta-ExternalFetcher	Meta	使用者觸發抓取	是
Bytespider	ByteDance	AI 訓練	部分
Amazonbot	Amazon	AI 搜尋索引	是
DuckAssistBot	DuckDuckGo	AI 搜尋索引	是
MistralAI-User	Mistral	使用者觸發抓取	是
YouBot	You.com	AI 搜尋索引	是

robots.txt

# AI crawler block list — generated from clickfrom.ai/tools/ai-crawler-user-agent-list
# Remove the Disallow line for any crawler you want to allow.

# OpenAI — Crawls public web pages to improve OpenAI foundation models.
# Source: https://platform.openai.com/docs/bots
User-agent: GPTBot
Disallow: /

# OpenAI — Indexes web pages so ChatGPT search and SearchGPT can cite them.
# Source: https://platform.openai.com/docs/bots
User-agent: OAI-SearchBot
Disallow: /

# OpenAI — Fetches a page on the spot when a ChatGPT user asks the assistant about a specific URL.
# Source: https://platform.openai.com/docs/bots
User-agent: ChatGPT-User
Disallow: /

# Anthropic — Crawls public web pages for Anthropic foundation-model training.
# Source: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler
User-agent: ClaudeBot
Disallow: /

# Anthropic — Indexes web pages so Claude can cite them in search-like answers.
# Source: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler
User-agent: Claude-SearchBot
Disallow: /

# Anthropic — Fetches a page on the spot when a Claude user asks the assistant about a specific URL.
# Source: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler
User-agent: Claude-User
Disallow: /

# Google — Opt-out token (not a real user-agent) controlling whether Gemini and Vertex AI may train on your content.
# Source: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers#google-extended
User-agent: Google-Extended
Disallow: /

# Google — Internal R&D and product-team crawls outside of Search and Ads.
# Source: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers#googleother
User-agent: GoogleOther
Disallow: /

# Google — Classical Google Search indexer. Powers AI Overviews via the same index.
# Source: https://developers.google.com/search/docs/crawling-indexing/googlebot
User-agent: Googlebot
Disallow: /

# Perplexity — Indexes web pages so Perplexity can surface them as cited sources in answers.
# Source: https://docs.perplexity.ai/guides/bots
User-agent: PerplexityBot
Disallow: /

# Perplexity — Fetches a page on the spot when a Perplexity user asks the assistant about a specific URL.
# Source: https://docs.perplexity.ai/guides/bots
User-agent: Perplexity-User
Disallow: /

# Apple — Powers Siri, Spotlight, and Safari Suggestions search.
# Source: https://support.apple.com/en-us/119829
User-agent: Applebot
Disallow: /

# Apple — Opt-out token controlling whether Apple Intelligence may train on your content.
# Source: https://support.apple.com/en-us/119829
User-agent: Applebot-Extended
Disallow: /

# Common Crawl — Bulk crawl of the public web. Downstream datasets feed many AI model training pipelines (including some at OpenAI, Anthropic, and academic groups).
# Source: https://commoncrawl.org/ccbot
User-agent: CCBot
Disallow: /

# Meta — Crawls public web pages for Meta AI (Llama family) training and indexing.
# Source: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/
User-agent: Meta-ExternalAgent
Disallow: /

# Meta — Fetches a page on the spot when a Meta AI user asks the assistant about a specific URL.
# Source: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/
User-agent: Meta-ExternalFetcher
Disallow: /

# ByteDance — Crawls public web pages for ByteDance's foundation-model training (Doubao and related models).
# Source: https://bytespider.bytedance.com/
User-agent: Bytespider
Disallow: /

# Amazon — Powers Alexa and other Amazon answer/AI experiences.
# Source: https://developer.amazon.com/amazonbot
User-agent: Amazonbot
Disallow: /

# DuckDuckGo — Indexes web pages so DuckAssist can summarize them in DuckDuckGo answers.
# Source: https://duckduckgo.com/duckduckgo-help-pages/results/duckassistbot/
User-agent: DuckAssistBot
Disallow: /

# Mistral — Fetches a page on the spot when a Mistral Le Chat user asks the assistant about a specific URL.
# Source: https://docs.mistral.ai/robots/
User-agent: MistralAI-User
Disallow: /

# You.com — Indexes web pages for You.com AI search and chat.
# Source: https://about.you.com/youbot/
User-agent: YouBot
Disallow: /

本清單顯示什麼

每個主要 AI 爬蟲的精確 User-agent 字串，來源於廠商文件
每個爬蟲是否遵守 robots.txt——以及例外存在於哪裡
每個爬蟲的用途：AI 訓練、AI 搜尋索引、使用者觸發抓取、傳統搜尋或共享資料集

為什麼有來源的爬蟲清單很重要

只有當你按照爬蟲自報的方式精確拼寫 User-agent 時，robots.txt 規則才會生效。一個拼寫錯誤（寫成「GPT-Bot」而非「GPTBot」）會靜默失效。本清單直接從廠商公開文件中提取每個名稱，讓你的 robots.txt 真正做到你想做的事。

商家如何使用本清單

把篩選後的「複製為 robots.txt」區塊貼到你的 Shopify robots.txt.liquid 覆寫檔中，阻擋你不想要的爬蟲
對於 Google-Extended 和 Applebot-Extended，請記住這些是 robots.txt token——它們永遠不會出現在你的存取記錄中
執行 /tools/robots-analyzer 檢查你目前的 robots.txt，驗證正確的爬蟲被允許或被阻擋

需要避免的常見錯誤

為了退出 AI Overviews 而阻擋 Googlebot——AI Overviews 沒有獨立的 UA，阻擋 Googlebot 會把你從常規 Google 搜尋中一起移除
假設使用者觸發的抓取器會遵守 robots.txt——Perplexity-User 明確不遵守
從部落格文章複製 UA 字串而不檢查廠商來源——名稱會變，部落格會過時

AI 爬蟲清單常見問題

我應該封鎖 Shopify 商店裡的 AI 爬蟲嗎？

通常不應該——大多數 AI 爬蟲正是購物者在 ChatGPT、Perplexity、Claude 和 Gemini 答案中找到你的途徑。只封鎖那些對你商店價值不明的爬蟲（如 Bytespider），或者那些你已經決定退出訓練的 opt-out token（Google-Extended、Applebot-Extended）。

本清單多久更新一次？

每當廠商發佈新爬蟲、棄用某個爬蟲或改變其聲明的 robots.txt 行為時。每個條目都連結到廠商來源，你可以直接核實。

為什麼有些條目標記為「部分」或「未明確」？

因為廠商聲明的行為和第三方稽核不一致，或者廠商尚未公開明確立場。當現實情況更複雜時，我們不會編造一個乾淨的「是」。