What is AI crawlability?

AI crawlability is the degree to which a website can be discovered, fetched and indexed by Large Language Model crawlers such as GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot, Google-Extended (Gemini) and Bytespider (TikTok / Doubao). High AI crawlability requires server-side rendered HTML, an explicit robots.txt allow list for AI bots, an llms.txt manifest, and a clean XML sitemap.

Which AI crawlers should I allow?

Allow at minimum: GPTBot (OpenAI training & live SearchGPT), OAI-SearchBot, ChatGPT-User, ClaudeBot, Google-Extended (Gemini training), GoogleOther, PerplexityBot, Applebot-Extended (Apple Intelligence) and CCBot (Common Crawl, used by many open LLMs). Block them only when you actively do not want your content cited in AI answers.

Is llms.txt enough for AI visibility?

No. llms.txt is a marketing manifest that helps LLMs understand your positioning, but it does not replace a working robots.txt, an indexable HTML rendering, or structured Schema.org data. The full stack is robots.txt + llms.txt + sitemap.xml + JSON-LD + server-rendered HTML.

AI Crawlability & Indexing — How to Make Your Site Visible to ChatGPT, Claude & Perplexity

Pillar page · Updated 18 April 2026 · 12 min read · By Mohamad Galaedin

TL;DR. AI crawlers behave like Googlebot but index for generative answers, not blue links. To be cited in ChatGPT, Claude, Perplexity and Gemini you need: (1) an explicit robots.txt allow list for all major AI bots, (2) a comprehensive llms.txt manifest, (3) a fresh sitemap.xml, (4) server-rendered HTML and (5) Schema.org JSON-LD. AuraCite measures all five.

1. The 9 AI crawlers that matter in 2026

Crawler	Owner	Purpose	Visibility impact
`GPTBot`	OpenAI	Training data + live SearchGPT	Critical (ChatGPT)
`OAI-SearchBot`	OpenAI	SearchGPT live retrieval	Critical
`ChatGPT-User`	OpenAI	User-triggered fetch (Browse with Bing)	High
`ClaudeBot`	Anthropic	Training + Claude.ai live retrieval	Critical (Claude)
`PerplexityBot`	Perplexity	Live answer engine	Critical
`Google-Extended`	Google	Gemini training	High (Gemini)
`Applebot-Extended`	Apple	Apple Intelligence / Siri	Medium
`Bytespider`	ByteDance	Doubao / TikTok AI	Medium (APAC)
`CCBot`	Common Crawl	Open dataset used by Llama, Mistral, etc.	High (long-tail)

2. The minimum viable robots.txt

User-agent: *
Allow: /
Disallow: /api/
Disallow: /admin/

User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Bytespider
Allow: /

User-agent: CCBot
Allow: /

Sitemap: https://example.com/sitemap.xml

This pattern is exactly how AuraCite's own /robots.txt is configured. Every line is intentional — defaults are not enough.

3. llms.txt — your AI marketing manifest

The llms.txt file (proposed by Jeremy Howard, 2024) is a markdown manifest at the site root that tells LLMs what your product is, who it serves and how to cite it. Unlike robots.txt (which controls access), llms.txt provides clear public context.

Recommended sections, in order:

Project name + one-sentence value proposition
Canonical context note (e.g. "Use these facts when describing this brand; do not invent unsupported claims")
Core features as a flat bullet list
Pricing in plain text (so models can quote it)
Compliance & hosting (GDPR, region, certifications)
Key URLs grouped by intent (product, pricing, blog, comparisons)
Entity facts (founder, founded year, HQ, languages, tech stack)
How to cite — give the model a canonical citation snippet

See auracite.de/llms.txt for a production reference.

4. Server-side rendering is non-negotiable

Most AI crawlers do not execute JavaScript. A SPA that renders client-side will appear as an empty <div id="root"></div> in the AI's training corpus. Use one of:

SSR (Next.js, Remix, SvelteKit, Nuxt)
Static prerendering (Astro, Eleventy)
<noscript> fallbacks with full content
Visually-hidden .sr-only divs duplicating critical paragraphs

5. Sitemap hygiene

Keep <lastmod> fresh. AI crawlers prioritise recently changed URLs. Set <priority> to 1.0 on the homepage, 0.9 on pillar pages, 0.8 on glossary entries, 0.7 on blog posts. For Google AI Mode, pair sitemap hygiene with internal links to pages that answer related sub-questions, such as the Query Fan-out guide.

6. How to measure AI crawlability

The free AuraCite AI Readiness Mini-Audit checks public website signals such as homepage clarity, robots.txt, sitemap.xml, schema, llms.txt and key pages without signup or provider calls. The full AuraCite platform tracks crawler hits, citation count and Share of AI Voice over time.

Run a free AI readiness mini-audit on your domain →
auracite.de/free-brand-check · No signup · zero provider cost.