robots.txt allow list for all major AI bots, (2) a comprehensive llms.txt manifest, (3) a fresh sitemap.xml, (4) server-rendered HTML and (5) Schema.org JSON-LD. AuraCite measures all five.
| Crawler | Owner | Purpose | Visibility impact |
|---|---|---|---|
GPTBot | OpenAI | Training data + live SearchGPT | Critical (ChatGPT) |
OAI-SearchBot | OpenAI | SearchGPT live retrieval | Critical |
ChatGPT-User | OpenAI | User-triggered fetch (Browse with Bing) | High |
ClaudeBot | Anthropic | Training + Claude.ai live retrieval | Critical (Claude) |
PerplexityBot | Perplexity | Live answer engine | Critical |
Google-Extended | Gemini training | High (Gemini) | |
Applebot-Extended | Apple | Apple Intelligence / Siri | Medium |
Bytespider | ByteDance | Doubao / TikTok AI | Medium (APAC) |
CCBot | Common Crawl | Open dataset used by Llama, Mistral, etc. | High (long-tail) |
User-agent: *
Allow: /
Disallow: /api/
Disallow: /admin/
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: Bytespider
Allow: /
User-agent: CCBot
Allow: /
Sitemap: https://example.com/sitemap.xml
This pattern is exactly how AuraCite's own /robots.txt is configured. Every line is intentional — defaults are not enough.
The llms.txt file (proposed by Jeremy Howard, 2024) is a markdown manifest at the site root that tells LLMs what your product is, who it serves and how to cite it. Unlike robots.txt (which controls access), llms.txt provides clear public context.
Recommended sections, in order:
See auracite.de/llms.txt for a production reference.
Most AI crawlers do not execute JavaScript. A SPA that renders client-side will appear as an empty <div id="root"></div> in the AI's training corpus. Use one of:
<noscript> fallbacks with full content.sr-only divs duplicating critical paragraphsKeep <lastmod> fresh. AI crawlers prioritise recently changed URLs. Set <priority> to 1.0 on the homepage, 0.9 on pillar pages, 0.8 on glossary entries, 0.7 on blog posts. For Google AI Mode, pair sitemap hygiene with internal links to pages that answer related sub-questions, such as the Query Fan-out guide.
The free AuraCite AI Readiness Mini-Audit checks public website signals such as homepage clarity, robots.txt, sitemap.xml, schema, llms.txt and key pages without signup or provider calls. The full AuraCite platform tracks crawler hits, citation count and Share of AI Voice over time.