Menu
Lesson 1 of 5

Lesson 01·GEO Foundations·~8 min·why → tactics

How a generative engine chooses what to cite

Before you change a single word on a page to "win at AI search," you have to see the page the way the engine does. It does not see a website. It sees a pile of passages competing to answer a question.

🎯 Why this is Lesson 1. Your mission needs a repeatable audit, a real grasp of how engines pick sources, and a way to measure it. Every one of those rests on this single mental model. Get the pipeline right and the tactics in later lessons stop feeling like a checklist of superstitions — they become obvious moves to make one stage's job easier.

In ~8 minutes you'll be able to

  • Explain the 4 moves a generative engine makes before it cites anything.
  • See why engines quote passages, not pages — and what that changes.
  • Tell an un-citable paragraph from a citable one at a glance.

Short on time? Watch the 60-second version.


Here's the whole lesson in one block — read it, then we'll unpack it:

The answer, up front

A generative engine answers you in four moves: it expands your question into many sub-questions, retrieves a handful of candidate passages for each, reranks and filters them through quality gates, then synthesizes an answer and attaches citations to the specific passages it used. The decisive consequence: engines cite passages, not pages. A page can be fetched and still never quoted — because no paragraph in it could stand on its own.

Notice what just happened: that grey box is a canonical answer block — a 40–80 word, front-loaded, self-contained answer. It's the single most "citable" shape a passage can take, and we'll use it constantly. This lesson is written in the form it's teaching.

01The pipeline, one stage at a time

Modern AI search runs on Retrieval-Augmented Generation (RAG): the model doesn't answer from memory, it answers from documents it fetches at query time. Citations can only come from what was retrieved. So the entire game is getting into — and through — this pipeline.

Understand & expand

Query fan-out

Your one question becomes 8–10 parallel sub-queries. "Best merit aid for engineering majors" silently spawns "average engineering merit scholarship 2026," "which universities stack merit + need aid," "Barrett Honors scholarship amount," and so on. AI-generated sub-queries run longer than human searches — averaging ~5.5 words on ChatGPT and ~9.1 on Gemini, versus ~3.4 for a classic Google query.[2]

What this means for youYou're not optimizing for one keyword. You're trying to show up across a fan of specific, long, related questions — which rewards breadth of genuinely answered sub-topics, not keyword repetition.
Fetch candidates

Hybrid retrieval

For each sub-query the engine pulls a small candidate set (Perplexity: ~5–10 pages) using hybrid retrieval — lexical matching (BM25, the actual words) plus dense vector embeddings (semantic meaning).[1] Pages are chunked and embedded so the engine can grab the right section, not the whole document.

What this means for youYou must be reachable and machine-readable: not blocked from AI crawlers, server-rendered (crawlers often don't run JavaScript), and chunked into clearly-bounded sections an embedding can isolate.
Filter hard

Rerank & quality gate

Candidates are re-scored for relevance, quality, and authority. Perplexity runs three reranking layers plus an XGBoost quality gate that checks entity clarity and authoritativeness — content passes through roughly six stages before earning a citation.[1] There's also a strong recency bias: fresher content wins.

What this means for youBeing retrieved ≠ being cited. Clear primary entity, visible authority (named author, references), and a recent "last updated" date are what carry a passage through the gates.
Write & attribute

Synthesize & cite passages

Finally the model writes an answer constrained by the surviving evidence and attaches citations to the specific passages it quoted — evaluating each passage independently for whether it stands alone and can be attributed cleanly. Only part of a page may get cited even when the page was retrieved.[1]

What this means for youEngineer the paragraph, not just the article. A self-contained, fact-dense, front-loaded passage is the unit that gets cited.

02The one idea to keep

If you remember nothing else

Engines cite passages, not pages.

Classic SEO got a page to rank. GEO gets a passage quoted. The shift is from "is my page relevant and authoritative overall?" to "can this specific paragraph be lifted out, understood without its neighbours, and attributed in one clean line?" That property has a name — extractability — and it's driven by self-containedness, ~40–80 word length, a front-loaded answer, named entities, and concrete facts.

To be precise: the engine still retrieves whole pages first — so your page has to be reachable and rank well enough to be pulled in — but what it actually quotes is the passage. Optimise the passage; don't neglect the page.

03See it: the same fact, un-citable vs. citable

Same claim, two shapes. Read each as if you're the engine deciding whether you can quote it in one line. (Figures below are illustrative.)

✕ Hard to cite

"We help families find the best merit aid out there. Our platform makes it easy to compare your options and save serious money on college, so you can focus on what matters."

No entity (who is "we"?), no number, no source, no standalone claim. Lift it into an answer and it says nothing checkable. Reranker drops it.

✓ Easy to cite

"Arizona State University's Barrett Honors College awards merit scholarships of up to $15,000/year; in 2026 the median merit package for out-of-state admits was about $9,400, according to MeritPlaybook's 2026 aid dataset."

Named entity, specific numbers, a date, an attributed source — and it stands alone. The engine can quote it verbatim and footnote you.

The right-hand version isn't "better writing" in a literary sense — it's more extractable. That distinction is the whole discipline. And it lines up with the only rigorous study we have: in Aggarwal et al.'s "GEO: Generative Engine Optimization", adding attributed quotations lifted visibility by ~41%, while old-school keyword stuffing made things ~10% worse than doing nothing.[3] We'll mine that paper for the full tactic leaderboard in Lesson 2.

04Check your understanding

4 quick scenarios

Click an answer to get instant feedback. No score is sent anywhere — this is your feedback loop.

Q1Your page ranks #2 on Google for a query but never appears in Perplexity's answer. What's the most likely GEO explanation?

Right idea: Being retrieved isn't being cited. High organic rank helps but doesn't guarantee citation — only ~38% of AI-Overview citations come from top-10 pages.[4] A clear, self-contained, attributable passage is what carries you through the gates.

Q2Which change most directly increases a paragraph's extractability?

Yes. Self-contained + front-loaded + named entity + concrete, attributed fact = the citable shape. Keyword stuffing actually reduces visibility in generative engines.[3]

Q3"Query fan-out" implies which strategy?

Correct. One question fans out into many longer sub-queries.[2] Breadth of genuinely-answered sub-topics beats density of one keyword.

Q4A client's React site renders its key content only in the browser via JavaScript. Why is that a GEO risk?

Exactly. If the passage isn't in the server-rendered HTML, it may never enter the candidate set — you lose before reranking even begins.[6]

05The honest caveat (so you can advise credibly)

Hold two views at once

A whole industry will sell you a 12-point "GEO checklist." But in May 2026, Google Search Central's official guidance pushed back hard: "optimizing for generative AI search is optimizing for the search experience, and thus still SEO." Google states it does not read llms.txt, does not need you to manually chunk content, and that schema is for rich results — not a magic AI-citation lever.[5]

Both things are true. The pipeline mechanics in this lesson are real (and especially visible on Perplexity/ChatGPT). And most of what makes a passage citable — clarity, authority, structure, facts — is just good SEO done well. The mature position you can defend to a client: GEO is mostly excellent SEO, re-pointed from "rank the page" to "make the passage quotable," with a few genuinely new moves (crawler access for AI bots, off-site consensus signals, AI-visibility measurement). We'll separate myth from method as we go.

Sources

  1. Authority Tech, "How Perplexity Selects Sources" (2026) — hybrid retrieval, 3 reranking layers + XGBoost gate, ~6 stages, passage-level extraction, recency bias.
  2. 85SIXTY, "How AI Query Fan-Out Is Reshaping SEO in 2026" — fan-out into 8–10 sub-queries; ChatGPT ~5.5 / Gemini ~9.1 / classic ~3.4 words.
  3. Aggarwal, Murahari, et al., "GEO: Generative Engine Optimization", KDD 2024 — quotation addition ~+41%; keyword stuffing ~−10% vs baseline.
  4. Search Engine Journal, "Google AI Overview Citations From Top-Ranking Pages Drop Sharply" (2026) — 38% top-10, 31.2% positions 11–100, 31% beyond 100; 54.5% from organically ranking pages.
  5. Google Search Central / blog.google (May 2026), summarized w/ quotes at We The Flywheel — "optimizing for generative AI search is … still SEO"; Google does not read llms.txt.
  6. Momentic, "AI Search Crawlers & Bots" (2026) & OpenAI bots docs — crawlers, JS-rendering limits, ~40% of sites accidentally block an AI crawler.

— end of lesson 1 —