Menu
Lesson 4 of 5

Lesson 04·GEO Foundations·~9 min·reach → audit

The access audit

You made your passages citable (Lesson 2) and built off-site consensus (Lesson 3). None of it counts if the engine can't fetch and read the page. This is the plumbing check — the one most sites silently fail.

🎯 Why this is Lesson 4. Everything so far assumed the engine can reach your words. Often it can't: a stray robots.txt rule, a page that only renders after JavaScript, a CDN challenge — any one of them makes a perfect passage invisible. This is the 15-minute audit that finds the silent blocks before they cost you citations.

In ~9 minutes you'll be able to

  • Name the AI crawlers that matter — and tell a search bot from a training bot.
  • Read your robots.txt the way an engine does, and spot an accidental block.
  • Run a repeatable reachability audit: robots, render, status, walls.

Short on time? Watch the 60-second version.


The whole lesson in one block:

The answer, up front

An AI engine can only cite a page it can actually fetch and read. Three gates quietly lock it out: a robots.txt rule that blocks an AI crawler (often by accident), content that only appears after JavaScript runs (most AI crawlers don't execute JS), and broken status or login walls. The access audit is a 15-minute check of those gates. The rule that ties it to everything else: if a passage isn't in the server-rendered HTML of a URL an AI bot is allowed to fetch, it cannot be cited — however good it is.

01Meet the crawlers (who's actually knocking)

"AI crawler" isn't one thing. The bots fall into three jobs, and the difference decides what blocking them costs you. The names below are the official user-agents the engines publish.[1]

Training crawlers blocking = a values choice

GPTBotClaudeBotCCBotGoogle-ExtendedApplebot-Extended

These gather text to train future models. Blocking them keeps your content out of training sets but does not hurt your visibility in AI answers — Google-Extended and Applebot-Extended are opt-out tokens that don't touch Search at all.[2] Deciding to block training is legitimate; just don't confuse it with the group above.

User-triggered fetchers mostly ignore robots.txt

ChatGPT-UserPerplexity-UserClaude-User

These fire when a person asks the assistant to open a specific link. Several state that robots.txt "may not apply" to a user-initiated fetch — so robots rules are only a partial control. (Anthropic says Claude-User honors robots by policy; OpenAI and Perplexity are explicit that theirs may not.)[3]

02robots.txt — the one file that can erase you

Every crawler checks yourdomain.com/robots.txt first. It's a plain-text list of requests — advisory, not enforced — but the well-behaved search bots obey it. That makes it the single highest-leverage file for GEO, and the easiest place to delete yourself by accident. Three classic mistakes:

Mistake 1 — a blanket block left on a live site. A staging rule that shipped to production:

✕ Erases you from AI search

# left over from staging
User-agent: *
Disallow: /

One slash. This tells every crawler — Google, Bing, and every AI search bot — to skip the whole site. You won't get an error; you'll just quietly vanish.

✓ Open to answers, training your call

# allow AI search; opt out of training
User-agent: *
Allow: /

User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /

Search/answer bots stay welcome; only the training crawlers are turned away. Separate the two decisions on purpose.

Mistake 2 — blocking the bot before it ever reads robots.txt. If your CDN or firewall (Cloudflare "block AI bots," a WAF rule, Bot Fight Mode) challenges or blocks the crawler's request, it never even sees your rules — the page is just gone from that engine. Most owners don't know it's on.[4]

Mistake 3 — confusing training with search. Owners add Disallow: Google-Extended expecting to drop out of AI Overviews. It does nothing of the sort — AI Overviews are governed by normal Search, not that token.[2] A blanket "block all AI" reflex cuts your answer visibility when you only meant to block training.

Reality checkMost sites don't block AI search bots on purpose — Cloudflare found only ~14% of crawled domains had any explicit AI-bot rule, and over 90% still allow them all.[4] The danger isn't the deliberate block. It's the accidental one you never notice.

03The JavaScript trap

Here's the one that catches modern sites. Most AI crawlers don't run JavaScript. They fetch your HTML, read it, and leave. When Vercel logged real crawler behavior, OpenAI's, Anthropic's, and Perplexity's bots fetched pages but never executed the JavaScript — a page whose content is rendered in the browser appeared blank to them.[5] Only Googlebot, Bingbot, and Applebot reliably render JS.

So a React/Vue/SPA page that loads its text client-side can rank fine in Google (which renders) and be invisible to Perplexity and ChatGPT (which don't). The fix is server-side rendering (SSR/SSG): the words must be in the HTML the server sends, before any script runs.

10-second testOpen your page, right-click → View Page Source (not "Inspect"), and Ctrl/Cmd-F for a sentence you can see on screen. If it's not in the raw source, most AI crawlers can't see it either.

04The Access Checklist — can AI see you?

Run this on one important URL. Toggle on each thing you can confirm is true, and watch how reachable that page is to an AI engine. It's the technical sibling of Lesson 3's Consensus Builder.

Interactive · your feedback loop

Access Checklist

Click each gate you can confirm is open. This is a reachability simulator showing relative leverage — not a measured crawl.

Reachable to AI (illustrative)Invisible · 8/100
Start with a live domain that resolves — that's the 8 you begin with. Now open the gates an engine actually checks. + robots.txt allows AI search bots. The biggest single lever. If OAI-SearchBot, Claude-SearchBot, or PerplexityBot is disallowed, you're simply not in that engine's answers — no matter what else is right. + Server-rendered content. Most AI crawlers don't run JavaScript. If your words only appear after the browser executes JS, the bot sees a blank page. SSR/SSG puts the text in the HTML. + Clean 200. Soft-404s, redirect chains, and 5xx errors drop you from the candidate set before retrieval even starts. One stable URL per passage. + No wall. A login, hard paywall, or cookie/age interstitial in front of the content means the crawler gets the wall, not the words. + CDN lets them through. "Block AI bots" toggles and bot-fight rules challenge crawlers before they read robots.txt. Allow the search/answer bots explicitly. + In the sitemap. A small, real boost to discovery and recrawl freshness — useful, but it can't rescue a page the bot isn't allowed to fetch or can't render.

Tip: turn off "server-rendered" or "robots allows" and watch reachability collapse — those two are most of the score for a reason.

05Check your understanding

3 quick checks

Click an answer for instant feedback. Nothing is sent anywhere.

Q1Your page ranks well on Google but never appears in Perplexity or ChatGPT answers. Content is rendered client-side in React. Most likely cause?

Right. Google renders JS; most AI crawlers don't. Server-render the content so it's in the raw HTML.[5]

Q2You want out of AI training sets but you still want to show up in AI answers. What do you do?

Exactly. Training and search are independently controllable per vendor. Blocking Google-Extended/GPTBot doesn't affect your answer visibility.[2]

Q3Fastest way to check whether an AI crawler can actually see a given page's content?

Yes. "Inspect" shows the rendered DOM (post-JavaScript); View Source shows the raw HTML the crawler gets. The gap between them is exactly what hides you.[5]

06The honest caveat

What access can and can't do

Access is necessary, not sufficient. Passing this audit doesn't earn you a citation — it earns you the right to compete for one. Everything from Lessons 1–3 still has to be true: an extractable passage, real authority, off-site consensus. The audit just makes sure the door is open.

robots.txt is a request, not a wall, and it only governs polite bots. User-triggered fetchers may ignore it, and rules can take a day or more to be honored after you change them.[3] Don't treat it as security. And skip the hype around llms.txt: as of 2026 no major engine relies on it — Google has said plainly it doesn't use it — so it's optional future-proofing, not a citation lever.[6] The crawler landscape also shifts fast; re-run this audit when you relaunch or replatform.

Sources

  1. Official crawler docs: OpenAI GPTBot / OAI-SearchBot / ChatGPT-User; Anthropic ClaudeBot / Claude-SearchBot / Claude-User; Perplexity PerplexityBot / Perplexity-User; Google Googlebot & crawler list. Each engine publishes the user-agent and its purpose (training vs search vs user-triggered).
  2. Google, Google-Extended — a robots.txt opt-out token controlling Gemini training/grounding only; it does not affect inclusion in Search or AI Overviews. Apple's Applebot-Extended is the equivalent training-only opt-out. Training and search are separately controllable.
  3. OpenAI and Perplexity state that user-initiated fetches (ChatGPT-User, Perplexity-User) may not apply robots.txt; Anthropic states Claude-User honors it by policy. Vendor bot docs (OpenAI, Perplexity, Anthropic), 2026. robots.txt is advisory and can take ~24h+ to take effect.
  4. Cloudflare, "From Googlebot to GPTBot — who's crawling your site" (2025) — only ~14% of domains with a robots.txt set any explicit AI-bot directive; >90% still allow all AI crawlers; GPTBot is the most-blocked. CDN/WAF challenges can block a crawler before it reads robots.txt.
  5. Vercel, "The rise of the AI crawler" (2025) — log study: OAI-SearchBot, ChatGPT-User, ClaudeBot, PerplexityBot and others fetch HTML but do not execute JavaScript; only Googlebot, Bingbot, and Applebot render. Client-only content appears blank to non-rendering crawlers — server-render it.
  6. On llms.txt adoption/effect: SE Ranking, "Does llms.txt actually work?" (2026, ~300k-domain study) — ~10% adoption; no measurable lift in AI citations. Google's Gary Illyes (reported, mid-2025) said Google does not use llms.txt. Treat it as optional, not a ranking factor.

— end of lesson 4 —