There is a configuration mistake sitting in thousands of B2B websites right now. It generates no error message, triggers no Google Search Console alert, and produces zero warnings in your CMS. But it is silently making your content invisible to every AI assistant that your prospects use — ChatGPT, Claude, Perplexity, and more.
It lives in your robots.txt file. And the fix takes about two minutes once you know what to look for.
The Silent AI Visibility Killer Hiding in Your robots.txt
Here is what happened. At some point — during a site migration, a relaunch, a staging environment setup, or a security review — someone added a blanket disallow rule to your robots.txt:
User-agent: *
Disallow: /
This two-line block tells every crawler on the internet to stay off your site entirely. It is a standard configuration for staging environments and sites under construction. The problem is that it frequently survives into production. It gets copy-pasted from a template, committed during a rushed deployment, and never revisited.
When Googlebot hits this rule, it stops crawling — but at least you will notice the consequences: your pages disappear from Google Search within weeks, traffic collapses, and Search Console lights up with coverage errors. The feedback loop is brutal, but it is fast.
AI crawlers are different. GPTBot, ClaudeBot, and PerplexityBot do not report back to any dashboard you own. When they are blocked, they simply move on. You receive no notification. Your pages do not disappear from a ranking you can monitor. The invisible consequence is that your content never makes it into the knowledge base these AI systems draw from — and your competitors who are not blocked start appearing in AI-generated answers instead of you.
This is not a theoretical risk. It is happening right now to sites that have been live for years, maintained by teams that would be horrified to learn their robots.txt is doing this.
The 5 AI Crawlers You Need to Allow
Each major AI platform operates its own crawler. Blocking any one of them means your content is excluded from that platform's answers, citations, and knowledge retrieval. Here are the five you need to explicitly account for:
| AI Engine | Crawler Name | User-Agent String |
|---|---|---|
| ChatGPT | GPTBot | GPTBot |
| Claude / Anthropic | ClaudeBot | ClaudeBot |
| Perplexity | PerplexityBot | PerplexityBot |
| Google AI Overviews | Google-Extended | Google-Extended |
| Common Crawl | CCBot | CCBot |
A blanket User-agent: * / Disallow: / rule catches all five of these. Even if you have explicit Allow rules for Googlebot and Bingbot, those carve-outs do not automatically extend to AI crawlers. Each user-agent string requires its own explicit rule if you have a wildcard block in place.
Common Crawl (CCBot) deserves special mention. It is the open dataset that multiple AI training pipelines — including earlier versions of GPT models — have drawn from heavily. Blocking CCBot does not just affect one platform; it affects any AI system that references Common Crawl data.
How to Check Your Current robots.txt (3 Methods)
Method 1: Visit Your robots.txt Directly
Navigate to https://yourdomain.com/robots.txt in your browser. This file must be publicly accessible at exactly this path. Look for any of the following patterns that indicate a problem:
User-agent: *followed byDisallow: /— blocks everything for all crawlersUser-agent: *followed byDisallow: /with onlyAllow:rules for specific bots (if AI crawlers are not listed, they are blocked)- No mention of
GPTBot,ClaudeBot,PerplexityBot, orGoogle-Extendedanywhere in the file
Method 2: Google Search Console robots.txt Tester
In Google Search Console, navigate to Settings and find the robots.txt tester. You can enter specific user-agent strings and test URLs to see whether they would be allowed or blocked under your current configuration. Test each of the five crawler names listed above against your homepage and your most important content pages.
Note that this tool shows the current live file and lets you test against it — it is not a crawler simulation, but it is accurate for robots.txt parsing.
Method 3: Run a Free AEO Audit
Both methods above require manual inspection and interpretation. The fastest option is to run your domain through our free AEO audit at aeoauditool.com. It automatically fetches your robots.txt, parses the rules, and checks all five major AI crawlers against your current configuration in about 30 seconds. You get a clear pass/fail result for each crawler with no ambiguity.
The Correct robots.txt Configuration for AI Crawlers
If your audit reveals that AI crawlers are currently blocked, here is the exact syntax to fix it. Add these rules to your robots.txt. If you have a blanket User-agent: * / Disallow: / rule, these explicit allow rules need to appear before it in the file, since robots.txt parsers use the most specific matching rule:
# Allow all AI crawlers (add this if they're currently blocked)
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
# Block AI training (optional — allows crawling but prevents training data use)
User-agent: GPTBot
Disallow: /private/
A few important notes on this configuration:
Specificity wins. When a crawler matches both a specific user-agent rule and a wildcard rule, the specific rule takes precedence. Placing User-agent: GPTBot / Allow: / before your User-agent: * / Disallow: / ensures GPTBot is explicitly permitted.
Each crawler needs its own block. You cannot combine multiple user-agents in a single Allow statement. Each bot gets its own User-agent: / Allow: / Disallow: group.
CCBot is typically fine to allow. It is a legitimate academic and research crawler operated by the Common Crawl Foundation. If you have concerns about training data specifically, use the optional disallow pattern shown above for the sections you want to protect.
Should You Block AI Crawlers for Training vs. Crawling?
This is the nuance that most guides skip. There are actually two separate things you might want to control:
- AI training data collection — whether your content is used to train the underlying model
- AI search crawling — whether your content can be retrieved and cited in AI-generated answers
Blocking GPTBot entirely prevents both. But if your goal is to appear in ChatGPT search results and AI citations while opting out of future training datasets, you need a more targeted approach.
Currently, OpenAI is the only major AI company that documents a clear mechanism for this distinction. You can allow GPTBot to crawl (for search and retrieval) while using the Disallow: directive to exclude specific paths from training use. Anthropic and Perplexity do not yet publish equivalent per-use-case controls, which means with those crawlers, it is currently an all-or-nothing decision.
For most B2B sites, the recommendation is to allow all five AI crawlers without restriction. The competitive cost of being excluded from AI search results is far higher than the marginal risk of training data use — especially given that your content is likely already publicly indexed and accessible through other means.
If you have genuinely sensitive content (unreleased research, proprietary methodology, client data that should not be public), the right fix is access controls at the application level, not robots.txt. Robots.txt is a request, not a technical barrier — any crawler that wants to ignore it can.
Frequently Asked Questions
Which AI crawlers should I allow?
At minimum, allow GPTBot (ChatGPT), PerplexityBot (Perplexity AI), ClaudeBot (Anthropic/Claude), and Google-Extended (Google AI Overviews). These four cover the AI platforms your B2B prospects are most likely using for research. Adding CCBot covers the Common Crawl dataset used by multiple additional AI systems. Unless you have a specific legal or competitive reason to block them, allowing all five is the right default.
Can I block AI crawlers from training on my content but still appear in AI search results?
Yes, but only OpenAI documents how to do this clearly. For GPTBot, you can use Allow: / at the root level and selectively add Disallow: directives for specific paths you want excluded from training data. OpenAI has stated that this distinction is honored in their crawl pipeline. For ClaudeBot and PerplexityBot, no equivalent documented mechanism currently exists — blocking those crawlers at any path prevents both training and citation retrieval from those paths. Monitor each platform's developer documentation as this area is evolving quickly.
How quickly does allowing AI crawlers affect my visibility?
GPTBot visits high-value pages — homepages, product pages, and content that has existing authority signals — within days of being unblocked. However, appearing in ChatGPT or Perplexity results depends on more than just crawl access. Your content needs to be clearly structured, answer specific questions well, and carry sufficient topical authority. Unblocking crawlers is the prerequisite; optimizing your content for AI retrieval is the ongoing work. Expect to see changes in AI citation monitoring tools within two to four weeks of making the robots.txt change.
Does blocking Google-Extended affect regular Google rankings?
No. Google-Extended is a completely separate user-agent from Googlebot. Blocking Google-Extended only affects whether your content is used for Google AI Overviews, Gemini responses, and Google's AI training datasets. It has no effect on regular organic search crawling, indexing, or ranking. Googlebot operates independently and is not governed by Google-Extended rules. This means you can block Google-Extended while maintaining your full organic search presence — though given how prominent AI Overviews have become in Google SERPs, blocking it does carry a real visibility cost for informational queries.
Run a free audit at aeoauditool.com to see all 5 AI crawler checks for your domain — it takes 30 seconds and shows you exactly which bots are blocked and what to change.