Lesson 04·GEO Foundations·~9 min·reach → audit
The access audit
You made your passages citable (Lesson 2) and built off-site consensus (Lesson 3). None of it counts if the engine can't fetch and read the page. This is the plumbing check — the one most sites silently fail.
robots.txt rule, a page that only renders after JavaScript, a CDN challenge — any one of them makes a perfect passage invisible. This is the 15-minute audit that finds the silent blocks before they cost you citations.
In ~9 minutes you'll be able to
- Name the AI crawlers that matter — and tell a search bot from a training bot.
- Read your
robots.txtthe way an engine does, and spot an accidental block. - Run a repeatable reachability audit: robots, render, status, walls.
Short on time? Watch the 60-second version.
The whole lesson in one block:
An AI engine can only cite a page it can actually fetch and read. Three gates quietly lock it out: a robots.txt rule that blocks an AI crawler (often by accident), content that only appears after JavaScript runs (most AI crawlers don't execute JS), and broken status or login walls. The access audit is a 15-minute check of those gates. The rule that ties it to everything else: if a passage isn't in the server-rendered HTML of a URL an AI bot is allowed to fetch, it cannot be cited — however good it is.
01Meet the crawlers (who's actually knocking)
"AI crawler" isn't one thing. The bots fall into three jobs, and the difference decides what blocking them costs you. The names below are the official user-agents the engines publish.[1]
Search & answer crawlers block these = invisible
These fetch pages so an engine can retrieve and cite them in a live answer. This is the GEO-critical group — block one and you remove yourself from that engine's answers. (Google AI Overviews and Gemini ride on Googlebot; Bing/Copilot on Bingbot.)[1]
Training crawlers blocking = a values choice
These gather text to train future models. Blocking them keeps your content out of training sets but does not hurt your visibility in AI answers — Google-Extended and Applebot-Extended are opt-out tokens that don't touch Search at all.[2] Deciding to block training is legitimate; just don't confuse it with the group above.
User-triggered fetchers mostly ignore robots.txt
These fire when a person asks the assistant to open a specific link. Several state that robots.txt "may not apply" to a user-initiated fetch — so robots rules are only a partial control. (Anthropic says Claude-User honors robots by policy; OpenAI and Perplexity are explicit that theirs may not.)[3]
02robots.txt — the one file that can erase you
Every crawler checks yourdomain.com/robots.txt first. It's a plain-text list of requests — advisory, not enforced — but the well-behaved search bots obey it. That makes it the single highest-leverage file for GEO, and the easiest place to delete yourself by accident. Three classic mistakes:
Mistake 1 — a blanket block left on a live site. A staging rule that shipped to production:
✕ Erases you from AI search
# left over from staging
User-agent: *
Disallow: /
One slash. This tells every crawler — Google, Bing, and every AI search bot — to skip the whole site. You won't get an error; you'll just quietly vanish.
✓ Open to answers, training your call
# allow AI search; opt out of training User-agent: * Allow: / User-agent: GPTBot Disallow: / User-agent: Google-Extended Disallow: /
Search/answer bots stay welcome; only the training crawlers are turned away. Separate the two decisions on purpose.
Mistake 2 — blocking the bot before it ever reads robots.txt. If your CDN or firewall (Cloudflare "block AI bots," a WAF rule, Bot Fight Mode) challenges or blocks the crawler's request, it never even sees your rules — the page is just gone from that engine. Most owners don't know it's on.[4]
Mistake 3 — confusing training with search. Owners add Disallow: Google-Extended expecting to drop out of AI Overviews. It does nothing of the sort — AI Overviews are governed by normal Search, not that token.[2] A blanket "block all AI" reflex cuts your answer visibility when you only meant to block training.
03The JavaScript trap
Here's the one that catches modern sites. Most AI crawlers don't run JavaScript. They fetch your HTML, read it, and leave. When Vercel logged real crawler behavior, OpenAI's, Anthropic's, and Perplexity's bots fetched pages but never executed the JavaScript — a page whose content is rendered in the browser appeared blank to them.[5] Only Googlebot, Bingbot, and Applebot reliably render JS.
So a React/Vue/SPA page that loads its text client-side can rank fine in Google (which renders) and be invisible to Perplexity and ChatGPT (which don't). The fix is server-side rendering (SSR/SSG): the words must be in the HTML the server sends, before any script runs.
04The Access Checklist — can AI see you?
Run this on one important URL. Toggle on each thing you can confirm is true, and watch how reachable that page is to an AI engine. It's the technical sibling of Lesson 3's Consensus Builder.
Interactive · your feedback loop
Access Checklist
Click each gate you can confirm is open. This is a reachability simulator showing relative leverage — not a measured crawl.
Tip: turn off "server-rendered" or "robots allows" and watch reachability collapse — those two are most of the score for a reason.
05Check your understanding
3 quick checks
Click an answer for instant feedback. Nothing is sent anywhere.
Q1Your page ranks well on Google but never appears in Perplexity or ChatGPT answers. Content is rendered client-side in React. Most likely cause?
Q2You want out of AI training sets but you still want to show up in AI answers. What do you do?
Q3Fastest way to check whether an AI crawler can actually see a given page's content?
06The honest caveat
What access can and can't do
Access is necessary, not sufficient. Passing this audit doesn't earn you a citation — it earns you the right to compete for one. Everything from Lessons 1–3 still has to be true: an extractable passage, real authority, off-site consensus. The audit just makes sure the door is open.
robots.txt is a request, not a wall, and it only governs polite bots. User-triggered fetchers may ignore it, and rules can take a day or more to be honored after you change them.[3] Don't treat it as security. And skip the hype around llms.txt: as of 2026 no major engine relies on it — Google has said plainly it doesn't use it — so it's optional future-proofing, not a citation lever.[6] The crawler landscape also shifts fast; re-run this audit when you relaunch or replatform.
Sources
- Official crawler docs: OpenAI GPTBot / OAI-SearchBot / ChatGPT-User; Anthropic ClaudeBot / Claude-SearchBot / Claude-User; Perplexity PerplexityBot / Perplexity-User; Google Googlebot & crawler list. Each engine publishes the user-agent and its purpose (training vs search vs user-triggered). ↩
- Google, Google-Extended — a robots.txt opt-out token controlling Gemini training/grounding only; it does not affect inclusion in Search or AI Overviews. Apple's Applebot-Extended is the equivalent training-only opt-out. Training and search are separately controllable. ↩
- OpenAI and Perplexity state that user-initiated fetches (
ChatGPT-User,Perplexity-User) may not applyrobots.txt; Anthropic statesClaude-Userhonors it by policy. Vendor bot docs (OpenAI, Perplexity, Anthropic), 2026. robots.txt is advisory and can take ~24h+ to take effect. ↩ - Cloudflare, "From Googlebot to GPTBot — who's crawling your site" (2025) — only ~14% of domains with a robots.txt set any explicit AI-bot directive; >90% still allow all AI crawlers; GPTBot is the most-blocked. CDN/WAF challenges can block a crawler before it reads robots.txt. ↩
- Vercel, "The rise of the AI crawler" (2025) — log study: OAI-SearchBot, ChatGPT-User, ClaudeBot, PerplexityBot and others fetch HTML but do not execute JavaScript; only Googlebot, Bingbot, and Applebot render. Client-only content appears blank to non-rendering crawlers — server-render it. ↩
- On
llms.txtadoption/effect: SE Ranking, "Does llms.txt actually work?" (2026, ~300k-domain study) — ~10% adoption; no measurable lift in AI citations. Google's Gary Illyes (reported, mid-2025) said Google does not usellms.txt. Treat it as optional, not a ranking factor. ↩
— end of lesson 4 —