跳转到主要内容
🇨🇳

AI 爬虫 User-Agent 列表

完整的 AI 爬虫和 user-agent 参考列表——它们做什么、谁运行的、是否遵守 robots.txt。

显示 21 个爬虫
User-agent厂商
GPTBotOpenAI
OAI-SearchBotOpenAI
ChatGPT-UserOpenAI
ClaudeBotAnthropic
Claude-SearchBotAnthropic
Claude-UserAnthropic
Google-ExtendedGoogle
GoogleOtherGoogle
GooglebotGoogle
PerplexityBotPerplexity
Perplexity-UserPerplexity
ApplebotApple
Applebot-ExtendedApple
CCBotCommon Crawl
Meta-ExternalAgentMeta
Meta-ExternalFetcherMeta
BytespiderByteDance
AmazonbotAmazon
DuckAssistBotDuckDuckGo
MistralAI-UserMistral
YouBotYou.com
robots.txt
# AI crawler block list — generated from clickfrom.ai/tools/ai-crawler-user-agent-list
# Remove the Disallow line for any crawler you want to allow.

# OpenAI — Crawls public web pages to improve OpenAI foundation models.
# Source: https://platform.openai.com/docs/bots
User-agent: GPTBot
Disallow: /

# OpenAI — Indexes web pages so ChatGPT search and SearchGPT can cite them.
# Source: https://platform.openai.com/docs/bots
User-agent: OAI-SearchBot
Disallow: /

# OpenAI — Fetches a page on the spot when a ChatGPT user asks the assistant about a specific URL.
# Source: https://platform.openai.com/docs/bots
User-agent: ChatGPT-User
Disallow: /

# Anthropic — Crawls public web pages for Anthropic foundation-model training.
# Source: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler
User-agent: ClaudeBot
Disallow: /

# Anthropic — Indexes web pages so Claude can cite them in search-like answers.
# Source: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler
User-agent: Claude-SearchBot
Disallow: /

# Anthropic — Fetches a page on the spot when a Claude user asks the assistant about a specific URL.
# Source: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler
User-agent: Claude-User
Disallow: /

# Google — Opt-out token (not a real user-agent) controlling whether Gemini and Vertex AI may train on your content.
# Source: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers#google-extended
User-agent: Google-Extended
Disallow: /

# Google — Internal R&D and product-team crawls outside of Search and Ads.
# Source: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers#googleother
User-agent: GoogleOther
Disallow: /

# Google — Classical Google Search indexer. Powers AI Overviews via the same index.
# Source: https://developers.google.com/search/docs/crawling-indexing/googlebot
User-agent: Googlebot
Disallow: /

# Perplexity — Indexes web pages so Perplexity can surface them as cited sources in answers.
# Source: https://docs.perplexity.ai/guides/bots
User-agent: PerplexityBot
Disallow: /

# Perplexity — Fetches a page on the spot when a Perplexity user asks the assistant about a specific URL.
# Source: https://docs.perplexity.ai/guides/bots
User-agent: Perplexity-User
Disallow: /

# Apple — Powers Siri, Spotlight, and Safari Suggestions search.
# Source: https://support.apple.com/en-us/119829
User-agent: Applebot
Disallow: /

# Apple — Opt-out token controlling whether Apple Intelligence may train on your content.
# Source: https://support.apple.com/en-us/119829
User-agent: Applebot-Extended
Disallow: /

# Common Crawl — Bulk crawl of the public web. Downstream datasets feed many AI model training pipelines (including some at OpenAI, Anthropic, and academic groups).
# Source: https://commoncrawl.org/ccbot
User-agent: CCBot
Disallow: /

# Meta — Crawls public web pages for Meta AI (Llama family) training and indexing.
# Source: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/
User-agent: Meta-ExternalAgent
Disallow: /

# Meta — Fetches a page on the spot when a Meta AI user asks the assistant about a specific URL.
# Source: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/
User-agent: Meta-ExternalFetcher
Disallow: /

# ByteDance — Crawls public web pages for ByteDance's foundation-model training (Doubao and related models).
# Source: https://bytespider.bytedance.com/
User-agent: Bytespider
Disallow: /

# Amazon — Powers Alexa and other Amazon answer/AI experiences.
# Source: https://developer.amazon.com/amazonbot
User-agent: Amazonbot
Disallow: /

# DuckDuckGo — Indexes web pages so DuckAssist can summarize them in DuckDuckGo answers.
# Source: https://duckduckgo.com/duckduckgo-help-pages/results/duckassistbot/
User-agent: DuckAssistBot
Disallow: /

# Mistral — Fetches a page on the spot when a Mistral Le Chat user asks the assistant about a specific URL.
# Source: https://docs.mistral.ai/robots/
User-agent: MistralAI-User
Disallow: /

# You.com — Indexes web pages for You.com AI search and chat.
# Source: https://about.you.com/youbot/
User-agent: YouBot
Disallow: /

本列表显示什么

  • 每个主要 AI 爬虫的精确 User-agent 字符串,来源于厂商文档
  • 每个爬虫是否遵守 robots.txt——以及例外存在于哪里
  • 每个爬虫的用途:AI 训练、AI 搜索索引、用户触发抓取、传统搜索或共享数据集

为什么有来源的爬虫列表很重要

只有当你按照爬虫自报的方式精确拼写 User-agent 时,robots.txt 规则才会生效。一个拼写错误(写成 “GPT-Bot” 而非 “GPTBot”)会静默失效。本列表直接从厂商公开文档中提取每个名称,让你的 robots.txt 真正做到你想做的事。

商家如何使用本列表

  • 把筛选后的“复制为 robots.txt”块粘贴到你的 Shopify robots.txt.liquid 覆盖文件中,以阻止你不希望的爬虫
  • 对于 Google-Extended 和 Applebot-Extended,请记住这些是 robots.txt token——它们永远不会出现在你的访问日志中
  • 运行 /tools/robots-analyzer 检查你当前的 robots.txt,验证正确的爬虫被允许或被阻止

需要避免的常见错误

  • 为了退出 AI Overviews 而阻止 Googlebot——AI Overviews 没有独立的 UA,阻止 Googlebot 会把你从常规 Google 搜索中一起移除
  • 假设用户触发的抓取器会遵守 robots.txt——Perplexity-User 明确不遵守
  • 从博客文章复制 UA 字符串而不检查厂商来源——名称会变,博客会过时

AI 爬虫列表常见问题

我应该屏蔽 Shopify 店铺里的 AI 爬虫吗?

通常不应该——大多数 AI 爬虫正是购物者在 ChatGPT、Perplexity、Claude 和 Gemini 答案中找到你的途径。只屏蔽那些对你店铺价值不明的爬虫(如 Bytespider),或者那些你已经决定退出训练的 opt-out token(Google-Extended、Applebot-Extended)。

本列表多久更新一次?

每当厂商发布新爬虫、弃用某个爬虫或改变其声明的 robots.txt 行为时。每个条目都链接到厂商来源,你可以直接核实。

为什么有些条目标记为 “部分” 或 “未明确”?

因为厂商声明的行为和第三方审计不一致,或者厂商尚未公开明确立场。当现实情况更复杂时,我们不会编造一个干净的“是”。