Список User-Agent AI-Краулеров

Справочный список всех основных AI-краулеров и user-agent — что они делают, кто их запускает, и уважают ли они robots.txt.

ПоискКатегорияПоведение robots.txt

Показано 21 краулер(ов)

User-agent	Производитель	Категория	Уважает robots.txt
GPTBot	OpenAI	AI-обучение	Да
OAI-SearchBot	OpenAI	AI-поисковый индекс	Да
ChatGPT-User	OpenAI	Запуск пользователем	Да
ClaudeBot	Anthropic	AI-обучение	Да
Claude-SearchBot	Anthropic	AI-поисковый индекс	Да
Claude-User	Anthropic	Запуск пользователем	Да
Google-Extended	Google	AI-обучение	Да
GoogleOther	Google	AI-обучение	Да
Googlebot	Google	Поисковая система	Да
PerplexityBot	Perplexity	AI-поисковый индекс	Да
Perplexity-User	Perplexity	Запуск пользователем	Нет
Applebot	Apple	Поисковая система	Да
Applebot-Extended	Apple	AI-обучение	Да
CCBot	Common Crawl	Общий датасет	Да
Meta-ExternalAgent	Meta	AI-обучение	Да
Meta-ExternalFetcher	Meta	Запуск пользователем	Да
Bytespider	ByteDance	AI-обучение	Частично
Amazonbot	Amazon	AI-поисковый индекс	Да
DuckAssistBot	DuckDuckGo	AI-поисковый индекс	Да
MistralAI-User	Mistral	Запуск пользователем	Да
YouBot	You.com	AI-поисковый индекс	Да

robots.txt

# AI crawler block list — generated from clickfrom.ai/tools/ai-crawler-user-agent-list
# Remove the Disallow line for any crawler you want to allow.

# OpenAI — Crawls public web pages to improve OpenAI foundation models.
# Source: https://platform.openai.com/docs/bots
User-agent: GPTBot
Disallow: /

# OpenAI — Indexes web pages so ChatGPT search and SearchGPT can cite them.
# Source: https://platform.openai.com/docs/bots
User-agent: OAI-SearchBot
Disallow: /

# OpenAI — Fetches a page on the spot when a ChatGPT user asks the assistant about a specific URL.
# Source: https://platform.openai.com/docs/bots
User-agent: ChatGPT-User
Disallow: /

# Anthropic — Crawls public web pages for Anthropic foundation-model training.
# Source: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler
User-agent: ClaudeBot
Disallow: /

# Anthropic — Indexes web pages so Claude can cite them in search-like answers.
# Source: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler
User-agent: Claude-SearchBot
Disallow: /

# Anthropic — Fetches a page on the spot when a Claude user asks the assistant about a specific URL.
# Source: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler
User-agent: Claude-User
Disallow: /

# Google — Opt-out token (not a real user-agent) controlling whether Gemini and Vertex AI may train on your content.
# Source: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers#google-extended
User-agent: Google-Extended
Disallow: /

# Google — Internal R&D and product-team crawls outside of Search and Ads.
# Source: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers#googleother
User-agent: GoogleOther
Disallow: /

# Google — Classical Google Search indexer. Powers AI Overviews via the same index.
# Source: https://developers.google.com/search/docs/crawling-indexing/googlebot
User-agent: Googlebot
Disallow: /

# Perplexity — Indexes web pages so Perplexity can surface them as cited sources in answers.
# Source: https://docs.perplexity.ai/guides/bots
User-agent: PerplexityBot
Disallow: /

# Perplexity — Fetches a page on the spot when a Perplexity user asks the assistant about a specific URL.
# Source: https://docs.perplexity.ai/guides/bots
User-agent: Perplexity-User
Disallow: /

# Apple — Powers Siri, Spotlight, and Safari Suggestions search.
# Source: https://support.apple.com/en-us/119829
User-agent: Applebot
Disallow: /

# Apple — Opt-out token controlling whether Apple Intelligence may train on your content.
# Source: https://support.apple.com/en-us/119829
User-agent: Applebot-Extended
Disallow: /

# Common Crawl — Bulk crawl of the public web. Downstream datasets feed many AI model training pipelines (including some at OpenAI, Anthropic, and academic groups).
# Source: https://commoncrawl.org/ccbot
User-agent: CCBot
Disallow: /

# Meta — Crawls public web pages for Meta AI (Llama family) training and indexing.
# Source: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/
User-agent: Meta-ExternalAgent
Disallow: /

# Meta — Fetches a page on the spot when a Meta AI user asks the assistant about a specific URL.
# Source: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/
User-agent: Meta-ExternalFetcher
Disallow: /

# ByteDance — Crawls public web pages for ByteDance's foundation-model training (Doubao and related models).
# Source: https://bytespider.bytedance.com/
User-agent: Bytespider
Disallow: /

# Amazon — Powers Alexa and other Amazon answer/AI experiences.
# Source: https://developer.amazon.com/amazonbot
User-agent: Amazonbot
Disallow: /

# DuckDuckGo — Indexes web pages so DuckAssist can summarize them in DuckDuckGo answers.
# Source: https://duckduckgo.com/duckduckgo-help-pages/results/duckassistbot/
User-agent: DuckAssistBot
Disallow: /

# Mistral — Fetches a page on the spot when a Mistral Le Chat user asks the assistant about a specific URL.
# Source: https://docs.mistral.ai/robots/
User-agent: MistralAI-User
Disallow: /

# You.com — Indexes web pages for You.com AI search and chat.
# Source: https://about.you.com/youbot/
User-agent: YouBot
Disallow: /

Что показывает этот список

Точная строка User-agent каждого основного AI-краулера, взятая из документации производителя
Уважает ли каждый краулер robots.txt — и где существуют исключения
Для чего каждый краулер: обучение AI, индекс AI-поиска, запуск пользователем, классический поиск или общий датасет

Почему важен список краулеров со ссылками на источники

Правила robots.txt работают только если вы напишете User-agent точно так, как краулер сам себя называет. Опечатка («GPT-Bot» вместо «GPTBot») молча проваливается. Этот список берёт каждое имя напрямую из публичных документов производителя, чтобы ваш robots.txt действительно делал то, что вы намерены.

Как мерчанты используют этот список

Вставьте отфильтрованный блок «Копировать как robots.txt» в переопределение Shopify robots.txt.liquid, чтобы блокировать ненужных краулеров
Для Google-Extended и Applebot-Extended: это токены robots.txt — они никогда не появляются в ваших журналах доступа
Запустите /tools/robots-analyzer на вашем текущем robots.txt, чтобы проверить, что нужные краулеры разрешены или заблокированы

Частые ошибки

Блокировать Googlebot, чтобы отказаться от AI Overviews — для AI Overviews нет отдельного UA; блокировка Googlebot удаляет вас и из обычного Google Search
Предполагать, что fetcher'ы, запускаемые пользователем, уважают robots.txt — Perplexity-User явно не уважает
Копировать строку UA из поста в блоге без проверки источника производителя — имена меняются, блоги устаревают

FAQ список AI-краулеров

Стоит ли блокировать AI-краулеры в моём магазине Shopify?

Обычно нет — большинство AI-краулеров — это то, как покупатели находят вас в ответах ChatGPT, Perplexity, Claude и Gemini. Блокируйте только те краулеры, ценность которых для вашего магазина неясна (например, Bytespider), или те, чьи opt-out токены (Google-Extended, Applebot-Extended) вы решили не участвовать в обучении.

Как часто обновляется этот список?

Всякий раз, когда производитель публикует нового краулера, выводит из эксплуатации существующего или меняет заявленное поведение robots.txt. Каждая запись ссылается на источник производителя для прямой проверки.

Почему некоторые записи отмечены как «частично» или «неясно»?

Потому что заявленное поведение производителя и аудиты третьих сторон не совпадают, или производитель не опубликовал чёткой позиции. Мы не выдумываем чистое «да», когда реальность сложнее.

Связанные ресурсы AI-видимости

GPTBot robots.txt для Shopify Анализатор Robots Шаблон llms.txt (мода)