Everything so far builds a model that predicts the next token in one quick pass. Ask it a hard math or logic problem and it often blurts out a confident — but wrong — answer.
A reasoning model is the exact same Transformer, trained to think before it answers.
The key shift: think, then answer
A normal model goes straight to the answer:
Question → Answer
A reasoning model writes a long private train of thought first:
Question → step, step, step, check, backtrack… → Answer
That middle part is a chain of thought — the model "talking to itself" on a scratchpad, working through the problem one step at a time, before committing to a final answer.
Why this makes it smarter
Remember: each token the model generates is one burst of computation. By producing hundreds of reasoning tokens before answering, the model spends far more compute on a hard problem than a single-pass answer ever could.
more thinking tokens → more computation → better answers on hard problemsLetting the model write more steps before answering gives it more chances to work things out — like a student using scratch paper instead of answering instantly.
This is called test-time compute: you make the model smarter by letting it think longer at answer-time, not only by making it bigger at training-time.
How it is trained
Start from a normal instruction-tuned model, then add a reasoning stage:
| Step | What happens |
| Show worked examples | Train on answers that include step-by-step working, not just the final result. |
| Let it attempt hard problems | On questions with a checkable answer (math, code), the model generates many full solution attempts. |
| Reward only correct answers | Give a point when the final answer is actually right, and reinforce the whole thinking path that got there. |
| Repeat millions of times | The model gradually discovers which styles of thinking earn points. |
reward = 1 if final answer is correct, else 0No human grades the individual steps. The model only earns a point when it lands on the right final answer — so it has to figure out good reasoning on its own.
This is reinforcement learning on verifiable rewards: because math and code answers can be checked automatically, no human has to label the reasoning. The model effectively teaches itself.
What emerges — by itself
The surprising part: nobody hand-codes these behaviors. With enough practice on checkable problems, the model spontaneously starts to:
- Self-correct — "wait, that step is wrong, let me redo it."
- Break problems down — split a big question into smaller, easier pieces.
- Explore options — try several approaches and keep the one that works.
- Think longer on harder problems — spend more steps when a question is tough.
In the 100-book universe
Instead of instantly guessing the next plot point, the model first writes a scratchpad: "the gun appeared in chapter 1, the character is furious, an earlier promise was broken — so the likely next event is…" — and only then answers. We reward the chains that correctly predict held-out passages, and the model learns to reason about the story instead of pattern-matching.
The trade-off: reasoning is slower and more expensive — the model generates many hidden "thinking" tokens for every answer. The payoff is far higher accuracy on math, code, logic, and planning.
Same architecture, new skill: a reasoning model is not a different machine — it is a next-token predictor that has been rewarded for thinking out loud until it gets hard things right.