ChatGPT Is a Search Engine. Here's How It Works.

Perhaps the real AGI was the friends we made along the way.

David McSweeney

December 19, 2025

When you strip away all the marketing hype, all the vague “we got a glimpse of AGI” tweets, all the multi-million dollar VC funded GEO and AEO tools telling you that “everything has changed”, for anything remotely complex, and for the questions that actually matter to your business, ChatGPT isn’t a magical wizard, it’s a search engine.

In some ways a very sophisticated one. In others, a very crude one.

It cannot be anything other than a search engine without an unacceptable level of hallucination.

For any remotely complex question, answers simply must be grounded in up-to-date, real world information, and (other than a select “VIP lane” which I’ll cover) that information must be retrieved from the open web.

Turns out that optimizing for this search engine is surprisingly simple. Although it also exposes the huge competitive advantage (other than the obvious capital and compute) that Google have over OpenAI.

Here’s how it works.

ChatGPT behind the scenes
pay no attention to the man behind the curtain

The Beautiful Chaos Behind Your Answer

When you visit ChatGPT, the UI gives the impression that you’re talking to a single all-knowing, multi-talented oracle. At the time of writing, that oracle is called GPT 5.2.

You’re led to believe (by clever UI design and PR spin) that GPT 5.2 is thinking, planning, searching the web, reading tens of thousands of words of content, and slaving away to bring you the information you need. All (for the vast majority of users) for free.

But this is mainly a misdirection

Because, while OpenAI clearly don’t have a problem burning money, they’re not dumb. Inference is expensive. And inference on a huge multi-trillion parameter frontier model like GPT 5.2 is obscenely expensive. It’s also slow. Every token is cash. Every token is time.

GPT 5.2’s actual role is in fact relatively minor: synthesize a small, curated amount of context provided to it, and combine with its vast amounts of knowledge from training to generate an answer to the user’s question.

Other than very simple questions which can be answered with confidence from training data, GPT 5.2 enters late in the game. In fact, it’s the very last player to take the field.

(although, spoiler, it doesn’t know that, and it thinks it did it all)

The pipeline looks like this.

⬥⬥⬥

Stage 1: A tiny model attempts to classify the users query

sonic classifier
Sonic Classifier

The model names change over time, but right now, the very first “AI” that sees a given query is a tiny classification model called snc-pg-sw-3cls-ev3.

The snc- prefix here stands for “sonic” as this model is part of ChatGPT’s sonicberry system.

And sonic is an appropriate moniker, as this model is lightning fast. In a few ms (sometimes even less) it will attempt to return three probability scores:

  • no_search_prob
  • complex_search_prob
  • simple_search_prob

The scores sum to 1.

  • No search = the big model (GPT 5.2 - or perhaps not) can answer from its training data. “What is the capital of France?”, “should you glue cheese to a pizza?”.
  • Simple search = can probably be answered in a single search/turn. “Who is the current President of the United States?”
  • Complex search = will likely require multiple searches/turns to answer the question. “Who is the current CEO of Apple, and what age are they?”

Each type has a threshold, which is currently:

  • simple_search_threshold: 0
  • complex_search_threshold: 0.4
  • no_search_threshold: 0.2

Why is simple search lower than no search? For efficiency, because the algorithm will look something like:

python snippet for ChatGPT classification model
We can see all this in the network logs:
ChatGPT sonic classifier
ChatGPT sonic classifier model configs

This classifier model appears to be quite brittle and likely has an aggressive, hard timeout. If it doesn’t return its probability scores within that time (perhaps somewhere in the region of 10ms) it’s dumped from the chain. We often see a null return here, and can assume this is due to failure or timeout.

But certainly, snc-pg-sw-3cls-ev3 is going to catch a lot of simple questions that can be answered by the frontier model’s general knowledge, and skip the grounding pipeline.

So to clarify:

If no_search_prob returns with a confidence score above 0.2, we’ll be straight to GPT 5.2 for the final generation (a quick answer).

If not (either through classification failure, timeout, or a search triggering score), we’ll move to stage 2.

⬥⬥⬥

Stage 2: ‘Thinky’ decides what to search

Alpha Sonic Thinky
Alpha Sonic Thinky: The Real MVP

At this stage GPT 5.2 will go and search the web…

Joking, no it doesn’t.

Enter the second model in ChatGPT’s pipeline, which is currently called alpha.sonic_thinky_v1. From now on we’ll shorten that to ‘Thinky’ to make things simpler (I’m not typing that out every time).

Thinky is likely to be either a distilled version of the frontier model, or a completely specialist, fine-tune model optimized for generating search queries and filtering results. My money would be on the second option—a specialized, fine-tuned model. Why? Because it has to be fast enough to run iterative loops without annoying the user, and cheap enough to run on every single query.

Either way, this model now takes complete control until it has confidence that there’s enough information for GPT 5.2 to synthesize and generate its final response. The filtering could perhaps be done by another model, but that doesn’t matter too much, and we’ll assume it’s our boy Thinky.

He’s the real MVP, and (aside from the very large programmatic parts of the chain) the one doing the work for you behind the scenes.

Thinky’s first task is to decide whether to use web search and what to search for (remember we only got here if the classifier did not return that we don’t need to search). I would surmise that it does not trust the classifier model blindly, and will override if necessary.

We've seen evidence of this. When the classifier returns null or fails (as it did in our stress tests with ambiguous queries), the system defaults to handing the query to Thinky anyway. Thinky acts as the smarter, slower backup brain.

There are two distinct types of searches it will return for two different purposes (not necessarily at the same time, but simpler to keep it in one step here):

  1. Simple keyword searches for traditional search engines (Google, Bing)
  2. Semantic searches for cosine similarity scoring (and possibly for searching against internal indexes/cache)

The semantic searches are those long, seemingly keyword stuffed searches that everyone in the SEO space has been commenting on.

But they’re not keyword stuffed. They’re structured in a way that tilts the vector towards the intent. I already explained this on my X thread here, so I’ll just quote from there:

Short queries split their "weight" evenly across words.

"lawyer credentials" -> embedding is roughly half lawyer, half credentials.

The "lawyer" bit is dominating.

So what's the solution?

Add synonyms for the angle, but keep the topic mentioned once (or twice). It's still there, but it's diluted, and the embedding is pulled towards the intent.

For example, "lawyer qualifications credentials licensing certifications experience".

Now the embedding is weighted 20% towards "lawyers" and 80% towards the credentials cluster.

The vector points more specifically at what you actually want.
— Me on X

We end up with simple/semantic pairs:

  • Best running shoes 2025 -> running shoes 2025 top best list awards…
  • Running shoe retailer reviews -> running shoes shop store buy reviews review stars ratings…

As Chris Long recently observed, the semantic queries are currently running to an average of 15 words.

Of course there may be multiple simple/semantic pairs (query fan-out) to ensure there’s enough information for final synthesis. But I would assume that Thinky is instructed to return as few as possible since every search adds latency and compute. For the purposes of this walkthrough, we’ll mainly just focus on one simple search.

There’s no need to dig into the network logs here (although the evidence is there) as most of the industry has been tracking these for some time. What I will say is we don’t necessarily see all the semantic searches, particularly in the final return.

Expected timing/latency: doesn’t even register. 100-200ms max.

Note: I’m going to be providing optimized timings throughout this pipeline. In the real world, depending on network load, CPU/GPU availability, this can all take much longer. We’re likely looking at a task queue hold up and UI theatre on the frontend for a lot of this.
What about complex search?

Remember that complex_search_threshold of 0.4? If Thinky gets triggered with that flag, it doesn't just do one pass, It enters a Recursive Planner Loop.

It searches, reads the result, realizes it needs more info, and searches again. We've seen logs where search_turns_count goes up to 3. This is Thinky doing 'Chain of Thought' via 'Chain of Search.'

But for most queries, the standard single-pass pipeline described below is the workhorse.
⬥⬥⬥

Stage 3: First filtering and selection of candidate pages based on Meta Data

This stage uses the simple web searches returned by Thinky.

How deep they (OpenAI/ChatGPT) go is speculative. But for arguments sake we’ll say they search and retrieve the first 5 pages of results for each basic query, giving us 40-50 candidate pages per search.

The searches will all run in parallel. Indeed, the SERP page fetching will all run in parallel.

3 basic search queries x 5 pages = 15 concurrent SERP page fetches (or just 3 if they can get 5 SERP pages in one shot). There’s likely a computationally insignificant deduplication step after fetching, which doesn’t cause a twitch on the CPU.

Note: Reddit’s citation crash after the num=100 removal in September (OpenAI could only fetch 10 results), and subsequent recovery (they figured out how to go deeper) is solid evidence that this part of the chain existed before the enhanced semantic searches.

So what do they do with these results?

Well, other than being in those 50 results, this is the first gate.

For each simple search the retrieved results are passed back into Thinky for filtering, likely with:

  1. The initial user query + the basic search query
  2. Instructions for how to select pages (diversity, known/preferred domains, trust signals etc likely to play a part)
  3. Basic information for each page - title, meta or SERP displayed description, url, possibly (although not necessarily) date, possibly additional SERP display elements (stars, pricing information etc - hello structured data)

Thinky’s one job for this stage: filter this list of 50 candidate pages down to 10-20 for fetching (more simple searches, less candidates per search likely).

The pages that weren’t selected are dumped. Gate 1.

  • Expected timing/latency: 500ms - 1.5s to retrieve all search results and deduplicate (parallel fetching), 300-500ms for page selections (parallel)
  • Total time elapsed: under 2 seconds
  • Total candidate pages for each search after this stage: 50 (initial search retrieval) -> 10-20
⬥⬥⬥

Stage 4: Fetch, parse, and chunk the selected pages to prepare for embedding and semantic scoring

This is likely the biggest time bottleneck in the pipeline, since fetching a web page (particularly a large one) can be slow.

Which leads me to the conclusion that this is actually gate number 2. If your page is slow it’s going to get cut. And even if you do return your page in time, the speed at which you do so may dictate just how much of it (or more specifically its content) makes it to stage 5.

Here’s how I think this stage would work.

After stage 3 ChatGPT (or more specifically, the behind the scenes web.run black box that we’re trying to shine some light on) has 10-20 “interesting” candidate pages per basic search. It now needs to fetch them.

Fetching is i/o (waiting) while processing (parsing HTML into markdown, chunking) is CPU bound (but incredibly fast). So it’s likely that a single CPU thread requests all pages at once with a hard timeout, then immediately moves on to processing thousands of other users' requests while waiting for your data to arrive..

I’d speculate that this timeout might be in the region of 2 seconds. Which means if your server responds quickly, there’s more time to download your page. If your server is slow (TTFB over 1s) there’s less time to download your content (it’s likely to be truncated).

*Cough* Core Web Vitals.

Gemini 3.0 Pro explains it well:

The Dispatcher: One single CPU thread fires off 50 requests. It takes milliseconds. It doesn't "wait." It moves on to the next user's query.
The Event Loop: The server maintains 50 open "sockets" (lightweight connections). This costs almost zero CPU, just a tiny bit of RAM.
The Race:
Site A returns at 200ms: The Dispatcher gets a "ping," wakes up, grabs the HTML, strips/chunks it (CPU spike for 10ms), and throws the chunks into the "GPU Waiting Room."
Site B returns at 1.9s: The Dispatcher wakes up, grabs the HTML...
Site C is still loading at 2.0s: The Global Timer fires. The Dispatcher cuts the connection. Site C is dead.
The GPU Batch: Once the 2.0s timer expires (or enough pages have arrived), the system takes everything in the "GPU Waiting Room," batches it, and sends it to the Embedding Model.
— Gemini 3.0 Pro

Long story short for fetching/scraping: all pages are fetched at once. You have a couple of seconds (probably) for your server to respond and stream the HTML response. OpenAI can’t afford to wait for your potato server to respond and send the content, as all latency adds up and users are impatient, no matter how pretty the “Searching the web…” spinner is.

For chunking, all pages that responded in time are parsed and split (128 token chunks - with evidence suggesting the splitting is relatively crude), with whatever HTML they returned in time being included. If that was only the <head> section… unlucky. There may also be a hard limit on chunks/tokens (more on that later) per page.

All the chunks are then passed to the GPU queue for embedding.

  • Expected timing/latency: 2 seconds to retrieve the pages (hard timeout), 200-300ms to parse and split them.
  • Total time elapsed: 4.3 seconds
  • Total candidate pages for each search after this stage: still 10-20
⬥⬥⬥

Stage 5: Batch embed/score all the chunks

At this stage, one of Sam’s hundreds of thousands of H100s takes a break from generating Sora2 videos of Mickey Mouse fighting Darth Vader and enters the arena.

It grabs the chunks from all the pages, vectorizes them, compares each against the semantic query, and gives them a score. All this happens in less than the time it takes OpenAI to burn another million dollars, which is the shortest unit of time known to man.

On to stage 6.

  • Expected timing/latency: 100-200ms (network hops, actual calculations instant)
  • Total time elapsed: 4.5 seconds
  • Total candidate pages for each search after this stage: still 10-20
⬥⬥⬥

Stage 6: Final selection of pages for “Deep Reading”

So now we have scored chunks, what do we do with them?

Back to Thinky for final selection.

For each candidate page (remember we’re down to 10-20) it will receive:

  • The initial user query + the basic search query
  • The SERP information (title, description etc) that our “Interesting Pages” model got
  • A single 128 token snippet of content from the page

This single snippet of content is the top scoring chunk (cosine similarity match compared to the semantic query, which if you recall is intent weighted) for each page, which was calculated in stage 5.

Using this enriched page-level context Thinky will select the 3-5 pages that make it through to final selection/”deep reading”.

Pages that failed the cut here will still appear as sources in the UI (often displaying their rejected 'junk chunk' as a badge of shame).

  • Expected timing/latency: 200-300ms (network hops, actual calculations instant)
  • Total time elapsed: 4.8 seconds
  • Total candidate pages for each search after this stage: 3-5
⬥⬥⬥

Stage 7: Mix-In + final generation

So for a single web search does the final, expensive model just receive 3-5 chunks for context (i.e. one chunk from each page)?

No. Well, almost certainly no. For each of our winning pages, it likely receives either:

  1. A sliding window of context (the top scoring chunk for the page + the previous/next chunks).
  2. All chunks for winning pages (possible, but unlikely)
  3. All chunks up to a max token limit (also possible, but unlikely)
  4. A diverse selection of chunks that have been re-ranked with other semantic variations (quite possible)
  5. The single chunk (highly unlikely)
  6. A synthesized summary from a smaller model (unlikely, but possible)
  7. Possibly a more structured list of headings, or tables if they exist
Note: I can’t see how much context the model receives for each cited page hence why this part is speculative. But I can see direct evidence that it receives more than the single chunk (hence highly unlikely).

The sliding window feels the most probable to me, which for each page would be around 300 words of context. Possible that it (GPT 5.2) can also request more iteratively.

But that gives us a baseline of 384 tokens per page, and somewhere in the region of 5-6K tokens of context in total (depending on the number of initial web searches/finally selected pages).

Although it likely receives a little more…

Because at this stage (or as the result of different behind-the-scenes processes that we’re not focused on today) we get a load of additional sources mixed in.

One of these: a clear indicator of a “VIP lane” / local index.

Some high authority publications (think Time, Business Insider, Forbes) have clearly pre-synthesized, full page-level summaries. These appear to get dumped into the context (likely a semantic search against a local index).

Sidenote: This confirms the existence of a 'Fork in the Road.' While your site was fighting for its life in the live fetch gauntlet, Time.com was chilling in the VIP lane, its pre-summarized content ready to be served instantly.

We get some Reddit (likely via API).

We get shopping, maps etc (depending on the query).

We get news sources (often with pre-synthesized summaries).

I’ll explain a bit more about how we know this when we get to the evidence. But, it’s likely that final generation, depending on the query, will have a lot of this for context. Probably Thinky is orchestrating most of this, or it may just be crude programmatic retrieval from cached vector indexes (the intent weighted semantic searches).

Either way, GPT 5.2 will generate the final answer, pretend it did the lot (unaware of the fact it didn’t as we’ll see), and our background web retrieval probably added around 5 seconds of idealized total latency. Though in the real-world network congestion, task queues, and timeouts can stretch this to 30+ seconds - hence, the inconsistencies in how long it takes ChatGPT to answer the same questions.

With a bit of luck, that answer may even be accurate…

Now let’s move on to the evidence that points us here (after a quick recap on the full pipeline)

  1. Tiny Classifier model attempts to classify the user's query and decide whether to search
  2. Thinky decides what to search (simple and semantic)
  3. Retrieve search results and filter based on Meta Data (Thinky)
  4. Fetch, parse, and chunk the selected pages to prepare for embedding and semantic scoring
  5. Batch embed/score all the chunks
  6. Final selection of pages for “Deep Reading” (Thinky)
  7. Mix-In (Reddit, news, other internal indexes) + final generation
⬥⬥⬥

Breaking Down The Evidence, Logic, and Reasoning

You’ll perhaps have noted that I’ve sometimes used hedging language throughout the above.

Of course I can’t be sure of the exact behind-the-scenes processes, and the pipeline may differ slightly.

But what I am confident in is that my thesis is valid. And that there’s plenty of evidence to suggest that it is close to what’s happening. Even perhaps that if it’s not, that it should be.

Where there’s no direct evidence, we can use logic and reasoning. The old school, human form of reasoning - not UI theatre.

Let’s break it down piece-by-piece (or chunk by chunk if you prefer…)

1. A Single Frontier Model As The Sole Orchestrator Would Be Stupidity On A Galactic Scale

  • Applies to: the fact we have multiple models at various points of the chain, distilled context for final generation
  • Evidence: reasoning, logic, logs, tangential evidence

We all know that OpenAI doesn’t care about wasting VC $$$. Perhaps they even prefer it.

But even for OpenAI, flooding your frontier model with hundreds of thousands of tokens of context for every single user query would take wastefulness to a whole new level. If Sam A discovered his engineers were doing this, we’d probably be looking at another code red and a capitalized tweet.

Because again, inference is the expensive part. And every token comes at a cost. Fast, cheap, fine-tuned models (Thinky in this case) are used for filtering pages, programmatic maths (cosine similarity) is used for pulling the most relevant content from those pages.

Here’s Thinky in the logs.

Alpha Sonic Thinky V1
Thinky doing his thing

The tangential evidence we have: If you’re an AI Studio (or Gemini API) user, you may have noticed that whenever Gemini searches the web, or uses URL context, we never see tens of thousands of input tokens. Why? Because Gemini Pro 3.0 itself isn’t actually searching the web or reading the full pages. Same rules apply - likely smart orchestration of smaller models (with Google having the additional advantage of its search index).

Note: it's possible that GPT 5.2 may be involved in this part of the chain (communicating with the smaller model), but it's more likely that Thinky is handling all this itself. Certainly that would be the most optimal approach. The junk in the snippets also suggests GPT 5.2 is nowhere near this.

2. The Super Long-Tail Queries ARE For Semantic Scoring And/Or Search

Evidence: logic + junk chunks in the logs

As I covered in my previous X thread, there are two explanations for ChatGPT’s overly verbose, word vomit search queries: either they suddenly got really (like really, really) bad at searching the web, or they are for semantic search/scoring.

While I’d like to say it’s the former to unleash full unbridled Scottish sarcasm, it’s definitely the latter, particularly since the intent weighted queries work so well to match relevant content.

Newsflash: OpenAI is good at semantic search. In other news, water is wet, Scotland gets a lot of rain, and we’re also going to the World Cup in America (one of these is not like the others).

We can also see random junk in the audition snippets in the network logs (below), which proves there’s no intelligence here - it’s mathematical selection from crude text splitting.

3. Evidence For Audition Chunks and Semantic Scoring

Evidence: description field with snippet from content in the JSON (see also evidence 4 for the second part of this)

When a page makes it through to consideration for final selection, but doesn’t get cited, in the UI we often see it as a “Source”.

Behind the curtain, we can see its candidate chunk in the description field, often containing junk text like “Skip to content”, random fragments of navigation etc, but importantly, sometimes clearly from the middle of content (or even further down the page), proving that it’s a “score then find best semantic match” for representative chunk selection.

That junk in the snippet makes it all the way to the front end UI.

junk in description field
junk in the description field
The chunk above is likely scoring well semantically since it has key related terms in it, but is unappealing to Thinky (cited:false) due to the navigational/UX fragments.

4. Evidence For Additional Context Retrieval For “Winning” Pages

When a page is cited (cited:true), the description field is empty. The logical explanation; there’s a state change and the candidate is not needed anymore since the page made the cut and qualified for a deep read, with the additional context (sliding window or other previously proposed methods for expansion) used in final synthesis.

Another explanation (that I can't ignore) is that it's for front-end UI only, but token efficiency and other fingerprints, suggest not.

But there’s an exception to this rule…

5. Evidence For Local, “VIP” Index

High authority domains (think Business Insider, Forbes) and news sites (New York Times, NY Post etc) often have pre-synthesized, AI generated summaries of intent aligned pages. The snippet field (this is an earlier part of the chain) text does not appear on the page itself verbatim, it’s a distilled summary.

pre-synthesized snippet
pre-synthesized snippet

This could be generated on-the-fly for a select list of domains. But it’s more probable that this is a small local index/cache (or comes directly via API from the organization).

It’s quite possible that our  Thinky model sees the full summary for the “VIP” pages, giving them an edge in selection, or indeed that they skip this stage and go straight to the “mix-in” (below).

For final generation, the big model (GPT 5.2 currently) might receive these summaries (which are generally information dense), or perhaps semantically matched chunks. There’s no way to infer this, so for now, we’ll assume it gets the summaries.

6. Evidence For Mix-In

We can see various ref_types in the logs (i.e. “news”, "reddit"), and a clear indication that there are a number of different retrieval sources.

For example, these Reddit results likely come directly via API:

ref type reddit
ref_type Reddit
⬥⬥⬥

Addressing The Leaked System Prompts

But… but… we have leaked system prompts showing us that GPT 5.2 can search the web.

Oh sweet summer child, that’s all part of the theatre.

But in this case, that theatre is not for you. It’s for GPT 5.2 itself.

OpenAI does not want the following to happen:

  • User: “how did you find this information?”
  • ChatGPT: “Well we entered a complex multi-stage pipeline with a classifier model, followed by…”

Instead it wants:

  • ChatGPT: “I used my web search capabilities to…”

The best description I’ve ever heard for an LLM is an idiot savant with amnesia. LLMs are stateless, continuous conversation is a clever illusion. If you tell an LLM it did something it has no way to know whether that’s true or not, but it may check its system prompt to confirm it has that capability (even though it actually doesn’t).

For final generation, GPT 5.2 gets all the context collected by Thinky and chums and is told that it was the one that collected it all.

Quick check of the system prompt, seems plausible, generate the answer.

The illusion is complete.

⬥⬥⬥

Addressing The "Thinking" Text

We of course don't know with 100% certainty when the hand-off to the final model is. But it's likely right at the end.

That thinking text you see? Most of it is probably coming from Thinky running its context collection tasks (the clue is in the name).

We can even see the hard constraints the model has from time-to-time.

thinky thinking
Thinky, thinking...
AGI soon.

Addressing The GEO Spam

This 'Amnesiac Genius' design is also why the system is so vulnerable.

Because GPT 5.2 didn't actually do the research, it trusts the 'briefing binder' it was handed implicitly. If Thinky hands it a spammy press release, GPT-5.2 assumes it's valid because the prompt tells it: 'You found this.' It trusts itself, even when 'itself' is actually a crude scraping algorithm.

A couple of days ago, Glen Allsopp (legend) posted this chart on X.

Spam in top-cited domains

If you’ve read this far, the answer to why ChatGPT is more susceptible to spam should be obvious, but I’ll spell it out.

Google has a 28-year-old immune system. It understands off-page reputation. It knows that just because Yahoo Finance is authoritative doesn't mean every page on it is. It checks: 'Who links to this specific page? What do others say about it?' (all pre-computed of course)

ChatGPT's Thinky just checks the ID at the door. 'Is this domain on the list? Yes? Come on in.' It has no mechanism to distinguish between a Pulitzer Prize-winning article and a paid press release on the same domain.

Note: I’ve also been deep diving into Google’s AI Mode and how it chooses which pages to cite. There are similarities, but clear differences. Needs its own deepdive, so a post for another day.

They (OpenAI) might get better at fighting spam (if they care), but they have a HUGE disadvantage here.

One that seems to me to be insurmountable, and another reason why (in my opinion) Google is likely to win the AI battle and charts like this are ridiculous.

ridiculous chart showing google traffic decline
Chart Crime reaches a new low

We’re in gen AI’s wild west period. No point denying it, spam works (at least for ChatGPT). But if you’re not just looking for a short term hit, and you care about your search rankings and brand reputation, you should think very carefully before following the current zeitgeist of self aggrandising press releases, and “best X for Y” networks.

Because Google slaps are real, and they’re probably coming soon.

(I propose calling it the LLMings update)

Addressing Training Data (And The Lottery of Final Generation)

While ChatGPT is relying more and more on RAG (likely with strict instructions to prioritize retrieved context), the frontier model (GPT 5.2) does, of course, have mountains of knowledge from training.

The final generation is a synthesis. It is a fusion of the real-time context gathered by Thinky + GPT 5.2's frozen memory of the world.

  • The "Nike" Effect: If Thinky somehow fails to retrieve a single page mentioning "Nike" for a query about "best trainers," GPT 5.2 is probably going to mention it anyway. The concept of "Nike" is so deeply embedded in the model's weights, it sits so close to "sneakers" (or trainers for us Scots) in its internal vector space, that the model "knows" it is a leading brand by default.

But for specific or current questions, this training data is a fallback, not a primary source. And you can't rely on it.

The Lottery of Generation

This mix of frozen memory and live retrieval introduces a massive variable: Probability.

When GPT 5.2 writes the final answer, it's picking the next most likely token based on a temperature setting that ensures variety. Every answer is a dice roll. That's just how LLMs work.

In one generation, it might lead with the retrieved fact from Source A.

In the next generation, for the exact same query, it might lead with a general knowledge point about Nike.

This is why obsessively tracking the final "Answer" is missing the point. The output is non-deterministic. You're trying to “GEO” or "AEO" a cloud formation.

I’ll concede there is some value in tracking directionality (e.g., "Is the sentiment generally positive?", "are we showing up more"). But trying to reverse-engineer the algorithm based on the specific adjectives GPT 5.2 chose to use on Tuesday vs. Wednesday is a waste of time. There's far too much noise.

The only part of this chain that is deterministic, the only part you can reliably engineer for, is the retrieval. The search bit.

If Thinky doesn't find you, you're praying that the frontier model remembers you from a scrape 12 months ago. Which means, for the best chance of being cited, you need to rank in search. Gamble with your search rankings with GEO spam, gamble with your AI visibility.

The "Personalization" Blind Spot

But there is an even bigger reason why tracking the final answer is pointless: You are not the only variable.

We saw in the logs that the sonic_classification_result includes a num_messages history and likely other user-specific context vectors. The "Planner" (Thinky) isn't generating queries in a vacuum. It's generating queries for a specific user.

If I search for "best CRM" after spending 20 minutes asking about "enterprise security," Thinky will likely generate searches weighted towards "security" and "compliance."

If you search for "best CRM" after asking about "cheap email marketing," Thinky will generate searches weighted towards "price" and "simplicity."

The Result: We get two completely different sets of retrieval candidates, two different sets of "Audition Chunks," and two different final answers, even though the "prompt" that led to that answer was the same.

The "GEO" tools that scrape ChatGPT using clean, context-free accounts are showing you a "Generic Default" reality that effectively doesn't exist for real users. They're tracking a hallucination of consistency. The ones that steal user's actual conversations (naming no names)... well, that's a different story.

Real users have history. Real users have context. And because Thinky is context-aware, every user is effectively searching a slightly different version of the internet. You can't "rank #1" for a query that changes every time it is asked.

Don't Believe Me? Try It Yourself.

This isn't just theory. You can see this randomness in action right now.

Taking a Twitter joke to it's full conclusion, I built DaveGPT, a working clone of the retrieval pipeline we've just reverse-engineered. It uses the exact same mechanics: a Planner to generate queries, a scraper to fetch the live web, and a Vector Judge to select the winners.

DaveGPT
DaveGPT

But here is the test: Run the same query twice (there are only 2 for now).

Even with the exact same retrieval sources (because I've cached the scrape to save your patience), the final answer will drift. The adjectives change. The order of the list swaps. The emphasis shifts.

Why? Because the final synthesizer (GPT 5.2 in the real world, Gemini in the demo) is probabilistic.

If you can't get a consistent answer from a controlled simulation with cached data, how on earth do you expect to get a consistent "ranking" from a live, context-aware, personalized system serving 200 million users?

You can try DaveGPT here.

So What CAN You Optimize?

We circle all the way back round to…

Keyword research.

You don’t need to generate thousands of spam pages targeting every variation under the sun. As has always been the case, you want pages that target the main keywords, topics, and subtopics in your niche.

And for the “main” keyword for each page on your site you should have one “chunk” that’s going to have the highest cosine similarity match for the intent and be presented to Thinky as your audition chunk.

You don’t want the highest scoring chunk to be a random footer block (evidence from the logs shows random junk in the snippets).

And you also want to make sure it’s high up on the page, in case your content gets truncated during scraping.

Newsflash: we’ve basically been doing that for years to optimize for rich snippets/position zero.

Beyond that, you want to write clear, well structured content with logical headings and sections. If you haven’t been doing that… well what have you been doing?

See: The GEO grift.

But there’s no doubt there’s value in double checking which chunk is going to be selected for a page’s main intent, so I (of course) built a tool for that. It’s part of Verify, a brand new set of tools within QueryBurst, an app within an app, that are all about ensuring your site is optimized for RAG.

I’ll be writing more about Verify soon (you can find out a bit more about it on our homepage), or follow me on X for updates, but it’s live in Beta for Pro Subscribers (currently $59 p/m, which includes all our GSC focused reports) and there’s nothing else like it on the market.

In the video below I walk through a powerful workflow, which combines two of the Verify tools.

Verify is Beta, so don’t expect perfection, but do expect insights that are actually actionable, rather than tracking hundreds of thousands of prompts and scratching your head over variations in probabilistic answers.

And yes, it’s an SEO tool.

Because ChatGPT, as I’ve just spent 6,000 words explaining, is mainly a search engine.

Ergo, optimizing for it, is SEO.

(sorry GEO bros)

A Note on Methodology:
This analysis is based on a forensic examination of public-facing network logs, client-side configuration files, and behavioral stress-testing of the ChatGPT interface as of December 2025. While OpenAI does not publicly document this architecture, the artifacts presented here provide a definitive mechanical blueprint of the system in its current state. Internal codenames and specific thresholds are, of course, subject to change at any time.

JOIN SEO's MOST INFREQUENT BLOG

I can't promise I'll post on a schedule. But I can promise when a post hits your inbox, it will be worth reading.

* indicates required
David McSweeney

David McSweeney

QueryBurst Founder & SEO Consultant

David has been involved in SEO since the late 90s, consulting for 15 years, and was previously the blog editor for both Ahrefs and Seobility. He's an AI obsessive, early adopter, and used his 28 years experience in the industry, and deep knowledge of technical SEO to build QueryBurst - the world's first fully integrated AI Virtual SEO Consultant.

Leave a Reply

Your email address will not be published. Required fields are marked *

ABOUT QUERYBURST

QueryBurst is the world's first Virtual SEO Consultant. A fully integrated AI assistant for SEO that can navigate reports, apply complex filters, create tutorials, and provide strategic SEO analysis and advice based on your Search Console data and business goals. Always with full context of what you're looking at on your screen.
GET STARTED
CONTACT US
linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram