Prompt TipsJan 30, 20269 min

How ChatGPT Works (Without the Hand-Wavy Magic)

A practical, engineer-friendly tour of what's under the hood: tokens, transformers, attention, decoding, and why alignment changes what you see.

How ChatGPT Works (Without the Hand-Wavy Magic)

ChatGPT feels like a chat app. But under the hood it's closer to a very opinionated autocomplete system wrapped in a product.

That framing sounds like I'm dismissing it. I'm not. Autocomplete is exactly the right mental model-because it forces you to ask the right questions: autocomplete over what, with what context, under what constraints, and optimized for which objective?

Once you answer those, most "mysteries" of ChatGPT become predictable. And once it's predictable, you can prompt it like a grown-up.


The core loop: predict the next token

At inference time (when you're chatting), ChatGPT is generating one token at a time. A token is a chunk of text (sometimes a word, sometimes part of a word, sometimes punctuation). Your message gets tokenized, then the model predicts a probability distribution for the next token, picks one (more on how later), appends it to the context, and repeats.

A GPT-style model is a decoder-only Transformer: it processes a sequence left-to-right with causal masking, meaning each position can only "see" earlier tokens, never future ones. That's what makes it an autoregressive generator rather than a bidirectional encoder like classic BERT-style models. A good reference-style explanation of this exact pipeline-tokenization, embeddings, stacked decoder blocks, causal attention, logits over vocabulary, then autoregressive decoding-is laid out cleanly in a GPT architecture walkthrough in [1].

So when you ask, "How does ChatGPT know what to say?" the simplest answer is: it doesn't "know" in a database sense. It computes "what token is most likely next" given the conversation so far, and keeps going.

That's the engine. Everything else (personality, safety refusals, "helpfulness", tools) is either training or product scaffolding around that engine.


The Transformer part: attention + MLPs stacked deep

Inside each Transformer block you have two big pieces.

First, self-attention. This is the mechanism that lets the model weigh earlier tokens differently depending on what it's generating now. In a standard formulation, the model maps the hidden states to queries (Q), keys (K), and values (V); attention weights are computed from Q·K, and then used to mix V. The key point is not the math; it's the behavior: the model can dynamically decide what to "look at" in the context.

Second, a feed-forward network (FFN/MLP) that transforms representations token-by-token.

Then you stack those blocks many times. The model gradually builds richer internal representations of "what's going on" in the prompt.

If you want an intuition that's more concrete than "it attends," I like a memory analogy: attention behaves like a content-addressable retrieval step, where the current context forms cues and the model pulls relevant content from earlier tokens. A 2026 paper makes this mapping explicit-queries as retrieval cues, keys as indices, values as stored content-and tests it via interventions like swapping Q/K/V projections and observing retrieval vs hallucination behavior [2]. You don't need to buy every claim in that paper to use the intuition: attention is how the model routes information from earlier text into the next-token decision.

This is why prompt structure matters. You are literally shaping the "memory" the model can retrieve from.


Training vs. chatting: why it feels helpful (most of the time)

People often conflate "ChatGPT" with "GPT, the base model." They're not the same thing.

A base GPT model is trained primarily with a next-token objective on large corpora. That teaches language patterns, facts (imperfectly), styles, and a lot of latent capabilities.

But "being a helpful assistant" is a different skill. Products like ChatGPT get there through post-training: instruction tuning, preference optimization, safety policies, refusal behavior, and system-level guardrails.

You can see the behavioral effects in the wild: users notice that the model sometimes changes its willingness to answer certain questions ("it used to give me stock picks, now it refuses"). That's not the Transformer waking up moody. That's product and policy shaping what the assistant is allowed to do, and how it should respond when it can't. Community threads capture that lived experience even if they don't explain the mechanism [3].

Here's what I've noticed: when you understand that ChatGPT's "personality" is an overlay, you stop arguing with it and start giving it better constraints. You ask for alternatives, summaries, assumptions, or safe forms of output instead of trying to brute-force a forbidden answer.


Decoding: how it chooses what to say (and why it can ramble)

When the model has a probability distribution over next tokens, it still needs a rule to pick one.

If it always chose the highest-probability token, responses would be consistent but often bland and repetitive. So most systems use a sampling strategy (temperature, nucleus/top-p, etc.) to trade determinism for diversity.

This matters for prompting because your prompt can push the model into "high entropy" territory-vague tasks, conflicting constraints, missing definitions-where multiple continuations seem plausible. That's where you see hedging, rambling, or confident-sounding nonsense. It's not "lying"; it's sampling from a messy distribution you asked it to create.


Context window: it's not "memory" (until the product adds memory)

In the pure model, the only "memory" is the tokens it can see in the current context window. If something falls out of that window, it's gone for that response. That's why long chats can drift.

ChatGPT as a product can add longer-term features (like "memory") that store user preferences or facts outside the immediate context. OpenAI's product posts frequently talk about "longer memory" at the UX level [4]. That doesn't mean the core Transformer suddenly learned to remember yesterday. It means the application is retrieving stored data and injecting it back into the prompt or system context so the model can condition on it.

As a prompt engineer, you should treat "memory" as upstream data injection. Helpful when it works, dangerous when it's wrong, and always worth verifying for critical tasks.


Practical prompts that match how it works

The best prompts don't "convince" the model. They shape the context so the most probable next tokens are the ones you actually want.

Try these patterns:

You are helping me draft a technical decision record.

Context:
- System: <brief system description>
- Constraints: <latency, cost, compliance>
- Options: <A, B, C>

Task:
1) List key tradeoffs (with assumptions).
2) Recommend one option and justify it.
3) Add a "what could go wrong" section.

Output format:
Use headings and short paragraphs. No bullet lists.

This works because it creates strong retrieval cues (headings, constraints, explicit tasks) and reduces ambiguity about what token sequences are "correct" next.

If you want to reduce hallucination risk, do this:

Answer using only information stated in the Context section.
If something is missing, ask me 3 clarification questions instead of guessing.

That's not magic either. You're steering the model toward a continuation where "asking questions" is the most probable behavior under uncertainty.


Closing thought

ChatGPT isn't a person in a box. It's a next-token machine with attention-based retrieval and a big pile of alignment and product rules on top. Once you accept that, your prompting gets calmer and more effective. You stop trying to "get lucky" and start designing the context.


References

Documentation & Research

  1. Fariba Afrin Irany - "From Generative Modeling to Clinical Classification: A GPT-Based Architecture for EHR Notes" (decoder-only Transformer mechanics, causal attention, token-by-token generation) - arXiv cs.CL
    https://arxiv.org/abs/2601.21955

  2. Viet Hung Dinh et al. - "Memory Retrieval in Transformers: Insights from The Encoding Specificity Principle" (Q/K/V roles, attention-as-retrieval framing, interventions) - arXiv cs.LG
    https://arxiv.org/abs/2601.20282

  3. Marco Bornstein & Amrit Singh Bedi - "AI Cap-and-Trade: Efficiency Incentives for Accessibility and Sustainability" (scale, inference cost framing; mentions ChatGPT usage scale context) - arXiv
    http://arxiv.org/abs/2601.19886v1

Community Examples

  1. "What is going on with ChatGPT?" - r/ChatGPTPromptGenius (user-observed policy/behavior shifts in practice)
    https://www.reddit.com/r/ChatGPTPromptGenius/comments/1qo7mpn/what_is_going_on_with_chatgpt/

  2. "How many of you use ChatGPT every day and what do you actually use it for?" - r/ChatGPT (real-world usage patterns, useful for grounding examples)
    https://www.reddit.com/r/ChatGPT/comments/1qq97t2/how_many_of_you_use_chatgpt_every_day_and_what_do/

  3. "Introducing ChatGPT Go, now available worldwide" - OpenAI Blog (product-level "memory" and packaging context)
    https://openai.com/index/introducing-chatgpt-go

Ilia Ilinskii
Ilia Ilinskii

Founder of Rephrase-it. Building tools to help humans communicate with AI.

Related Articles