AI Crawler User-Agent List

Reference list of every major AI crawler and user-agent — what they do, who runs them, and whether they respect robots.txt.

SearchCategoryRobots.txt behavior

21 crawler(s) shown

User-agent	Vendor	Category	Respects robots.txt
GPTBot	OpenAI	AI training	Yes
OAI-SearchBot	OpenAI	AI search index	Yes
ChatGPT-User	OpenAI	User-triggered fetch	Yes
ClaudeBot	Anthropic	AI training	Yes
Claude-SearchBot	Anthropic	AI search index	Yes
Claude-User	Anthropic	User-triggered fetch	Yes
Google-Extended	Google	AI training	Yes
GoogleOther	Google	AI training	Yes
Googlebot	Google	Search engine	Yes
PerplexityBot	Perplexity	AI search index	Yes
Perplexity-User	Perplexity	User-triggered fetch	No
Applebot	Apple	Search engine	Yes
Applebot-Extended	Apple	AI training	Yes
CCBot	Common Crawl	Shared dataset	Yes
Meta-ExternalAgent	Meta	AI training	Yes
Meta-ExternalFetcher	Meta	User-triggered fetch	Yes
Bytespider	ByteDance	AI training	Partial
Amazonbot	Amazon	AI search index	Yes
DuckAssistBot	DuckDuckGo	AI search index	Yes
MistralAI-User	Mistral	User-triggered fetch	Yes
YouBot	You.com	AI search index	Yes

robots.txt

# AI crawler block list — generated from clickfrom.ai/tools/ai-crawler-user-agent-list
# Remove the Disallow line for any crawler you want to allow.

# OpenAI — Crawls public web pages to improve OpenAI foundation models.
# Source: https://platform.openai.com/docs/bots
User-agent: GPTBot
Disallow: /

# OpenAI — Indexes web pages so ChatGPT search and SearchGPT can cite them.
# Source: https://platform.openai.com/docs/bots
User-agent: OAI-SearchBot
Disallow: /

# OpenAI — Fetches a page on the spot when a ChatGPT user asks the assistant about a specific URL.
# Source: https://platform.openai.com/docs/bots
User-agent: ChatGPT-User
Disallow: /

# Anthropic — Crawls public web pages for Anthropic foundation-model training.
# Source: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler
User-agent: ClaudeBot
Disallow: /

# Anthropic — Indexes web pages so Claude can cite them in search-like answers.
# Source: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler
User-agent: Claude-SearchBot
Disallow: /

# Anthropic — Fetches a page on the spot when a Claude user asks the assistant about a specific URL.
# Source: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler
User-agent: Claude-User
Disallow: /

# Google — Opt-out token (not a real user-agent) controlling whether Gemini and Vertex AI may train on your content.
# Source: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers#google-extended
User-agent: Google-Extended
Disallow: /

# Google — Internal R&D and product-team crawls outside of Search and Ads.
# Source: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers#googleother
User-agent: GoogleOther
Disallow: /

# Google — Classical Google Search indexer. Powers AI Overviews via the same index.
# Source: https://developers.google.com/search/docs/crawling-indexing/googlebot
User-agent: Googlebot
Disallow: /

# Perplexity — Indexes web pages so Perplexity can surface them as cited sources in answers.
# Source: https://docs.perplexity.ai/guides/bots
User-agent: PerplexityBot
Disallow: /

# Perplexity — Fetches a page on the spot when a Perplexity user asks the assistant about a specific URL.
# Source: https://docs.perplexity.ai/guides/bots
User-agent: Perplexity-User
Disallow: /

# Apple — Powers Siri, Spotlight, and Safari Suggestions search.
# Source: https://support.apple.com/en-us/119829
User-agent: Applebot
Disallow: /

# Apple — Opt-out token controlling whether Apple Intelligence may train on your content.
# Source: https://support.apple.com/en-us/119829
User-agent: Applebot-Extended
Disallow: /

# Common Crawl — Bulk crawl of the public web. Downstream datasets feed many AI model training pipelines (including some at OpenAI, Anthropic, and academic groups).
# Source: https://commoncrawl.org/ccbot
User-agent: CCBot
Disallow: /

# Meta — Crawls public web pages for Meta AI (Llama family) training and indexing.
# Source: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/
User-agent: Meta-ExternalAgent
Disallow: /

# Meta — Fetches a page on the spot when a Meta AI user asks the assistant about a specific URL.
# Source: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/
User-agent: Meta-ExternalFetcher
Disallow: /

# ByteDance — Crawls public web pages for ByteDance's foundation-model training (Doubao and related models).
# Source: https://bytespider.bytedance.com/
User-agent: Bytespider
Disallow: /

# Amazon — Powers Alexa and other Amazon answer/AI experiences.
# Source: https://developer.amazon.com/amazonbot
User-agent: Amazonbot
Disallow: /

# DuckDuckGo — Indexes web pages so DuckAssist can summarize them in DuckDuckGo answers.
# Source: https://duckduckgo.com/duckduckgo-help-pages/results/duckassistbot/
User-agent: DuckAssistBot
Disallow: /

# Mistral — Fetches a page on the spot when a Mistral Le Chat user asks the assistant about a specific URL.
# Source: https://docs.mistral.ai/robots/
User-agent: MistralAI-User
Disallow: /

# You.com — Indexes web pages for You.com AI search and chat.
# Source: https://about.you.com/youbot/
User-agent: YouBot
Disallow: /

What this list shows

Every major AI crawler's exact User-agent string, sourced from vendor documentation
Whether each crawler respects robots.txt — and where exceptions exist
What each crawler is for: AI training, AI search index, user-triggered fetch, classical search, or shared dataset

Why a sourced crawler list matters

Robots.txt rules only work if you spell the User-agent exactly the way the crawler announces itself. A typo ("GPT-Bot" instead of "GPTBot") silently fails. This list pulls each name directly from the vendor's public docs so your robots.txt actually does what you intend.

How merchants use this list

Paste the filtered "Copy as robots.txt" block into your Shopify robots.txt.liquid override to block crawlers you don't want
For Google-Extended and Applebot-Extended, remember these are robots.txt tokens — they never appear in your access logs
Run /tools/robots-analyzer against your current robots.txt to verify the right crawlers are allowed or blocked

Common mistakes to avoid

Blocking Googlebot to opt out of AI Overviews — there is no separate UA for AI Overviews, blocking Googlebot removes you from regular Google Search too
Assuming user-triggered fetchers respect robots.txt — Perplexity-User explicitly does not
Copying a UA string from a blog post without checking the vendor source — names change, blogs go stale

AI crawler list FAQ

Should I block AI crawlers from my Shopify store?

Usually no — most AI crawlers are how shoppers find you in ChatGPT, Perplexity, Claude, and Gemini answers. Block only the crawlers whose value to your store is unclear (e.g. Bytespider) or whose opt-out tokens (Google-Extended, Applebot-Extended) you've decided not to participate in training for.

How often does this list update?

Whenever a vendor publishes a new crawler, deprecates one, or changes their stated robots.txt behavior. Every entry links to the vendor source so you can verify directly.

Why are some entries marked "partial" or "unclear"?

Because the vendor's stated behavior and third-party audits don't agree, or the vendor hasn't published a clear position. We don't fabricate a clean "yes" when reality is messier.

Related AI visibility resources

GPTBot robots.txt for Shopify Robots analyzer llms.txt template (fashion)