How LLMs Are Trained · use ← / →
LLM training from scratch · Slide 0 of 37

0. The whole pipeline

Core idea: an LLM is trained by hiding the next token, asking the model to guess it, measuring how wrong it was, and slightly changing its internal numbers so it guesses better next time.

In this toy universe, there are only 100 books ever written. The model's entire world knowledge comes from those books.

  1. Collect the books
  2. Clean the text
  3. Split text into tokens
  4. Turn tokens into numbers
  5. Create training examples
  6. Build a Transformer model
  7. Make the model predict the next token
  8. Measure error with loss
  9. Update weights with backpropagation + optimizer
  10. Repeat many times
  11. Evaluate
  12. Optionally instruction-tune
  13. Optionally align with human preferences
  14. Use it by generating one token at a time
Modern default: most general-purpose LLMs are decoder-only Transformer-family models.
LLM training from scratch · Slide 1 of 37

1. Define the toy universe

Assume the universe has only 100 books.

B = {b₁, b₂, ..., b₁₀₀}B is the complete set of books in this world — book 1, book 2, all the way to book 100. The curly braces just mean "a collection of".
  • B = all books in the universe
  • b₁ = book 1
  • b₂ = book 2
  • b₁₀₀ = book 100

Each book is just text.

Book 1: The cat sat on the mat.
Book 2: The king ruled the city.
Book 3: A river flows through the valley.
...

The model will never see anything outside these 100 books. If none of the books mention "airplanes," the model cannot truly learn airplanes from data. It might invent something by pattern, but it would not be grounded in this universe.

Remember: in this universe, "knowledge" means patterns learned from the 100 books.
LLM training from scratch · Slide 2 of 37

2. The actual training goal

The simplest LLM training goal is:

Given previous tokens, predict the next token.

This is called causal language modeling or next-token prediction.

Sentence:

The cat sat on the mat.

Possible tokens:

["The", "cat", "sat", "on", "the", "mat", "."]
InputCorrect next token
Thecat
The catsat
The cat saton
The cat sat onthe
The cat sat on themat
The cat sat on the mat.
P(next token | previous tokens)The probability of the next token, given the tokens that came before it. The vertical bar "|" reads as "given".

The model learns a probability distribution over possible next tokens.

TokenProbability after "The cat sat on the"
mat0.72
chair0.08
floor0.05
king0.01
river0.01
LLM training from scratch · Slide 3 of 37

3. Notation

Let the full token sequence from all 100 books be:

x₁, x₂, x₃, ..., x_TAll the text laid out as one long numbered list of tokens — the 1st, 2nd, 3rd, up to the T-th (the very last one).
  • xᵢ = the token at position i
  • T = total number of tokens in all 100 books
  • Example: x₁ = "The", x₂ = "cat"

The model has parameters:

θTheta: a single symbol standing in for every adjustable number inside the model.
  • θ = all trainable numbers inside the neural network
  • These include embedding weights, attention weights, feedforward weights, normalization weights, output weights, etc.

The training objective is:

maximize over θ: Σᵢ log P_θ(xᵢ | x₁, x₂, ..., xᵢ₋₁)Tune the weights θ so the model gives each real next token the highest possible probability — added up over every position in every book. (Σ means "add up over all positions".)

Plain English:

Choose model parameters θ that make the real next token as likely as possible across all token positions in all books.

Equivalent minimization form:

minimize over θ: −Σᵢ log P_θ(xᵢ | x_<i)The same goal, flipped upside down: make the total "surprise" of the correct tokens as small as possible. Maximizing probability and minimizing negative-log-probability are two ways of saying the same thing.

That negative quantity is the loss.

LLM training from scratch · Slide 4 of 37

4. Step 1 — collect and clean the books

You start with 100 raw books. Raw data may include page numbers, duplicate sections, OCR errors, strange formatting, or broken characters.

Chapter 1
THE CAT SAT ON THE MAT

Page 3

The cat sat on the mat.

Cleaning may include:

  • Remove duplicate pages
  • Fix broken encoding
  • Normalize weird quotation marks
  • Remove page numbers
  • Remove OCR errors
  • Preserve paragraph breaks
  • Keep chapter titles if useful
  • Remove corrupted text

After cleaning:

<book_1>
The cat sat on the mat. The dog slept near the fire.
</book_1>

<book_2>
The king ruled the city. The queen studied the stars.
</book_2>
Remember: data quality matters. A clean small dataset can beat a bigger messy dataset for some uses.
LLM training from scratch · Slide 5 of 37

5. Step 2 — split the books into train / validation / test

We should not train and evaluate only on the exact same text.

SplitBooksPurpose
Training90 booksUsed to update model weights
Validation5 booksUsed during training to check overfitting
Test5 booksUsed only at the end

Why?

If the model memorizes the 90 training books perfectly, we still need to know whether it can predict text from unseen books in the same universe.

MetricMeaning
Training performanceHow well it predicts books it studied
Validation/test performanceHow well it predicts books it did not train on
Overfitting signal: training loss goes down, but validation loss gets worse.
LLM training from scratch · Slide 6 of 37

6. Step 3 — tokenization

Computers do not understand words directly. We convert text into tokens.

A token is a chunk of text. It can be a character, word, subword, punctuation mark, whitespace marker, or byte-level chunk.

Example:

The cat sat.

Possible tokenization:

["The", "cat", "sat", "."]

Or:

["The", "Ġcat", "Ġsat", "."]

The symbol Ġ may mean "there was a space before this token."

The set of all possible tokens is the vocabulary.

V = vocabulary size
TokenID
<pad>0
<bos>1
<eos>2
The3
cat4
sat5
on6
mat7
.8
The cat sat on the mat.
→ [3, 4, 5, 6, 3, 7, 8]

Tokenization algorithms

MethodWhat it doesBest forWeakness
Character-levelEvery character is a tokenEasiest; no unknown wordsVery long sequences
Word-levelEvery word is a tokenSimple explanationFails on new words; huge vocab
BPEStarts from individual letters, repeatedly merges the most common pair (e.g. t+h → th, th+e → the)Common in LLMs; efficientCan split words oddly
WordPieceSimilar to BPEBERT-style modelsLess common for newer decoder LLMs
Unigram LMChooses likely subword pieces probabilisticallyGood multilingual tokenizationMore complex
Byte-level BPEWorks on raw bytesHandles any textSometimes less human-readable
Beginner toy model: word-level or character-level tokenization.
Real LLM: BPE, byte-level BPE, or unigram subword tokenization.

Tokenization is not a small detail. It affects vocabulary size, sequence length, multilingual behavior, memory use, and how cleanly the model handles rare words.

LLM training from scratch · Slide 7 of 37

7. Step 4 — create training sequences

LLMs cannot usually read infinite text at once. They use a fixed context length.

C = context length

Example:

C = 8

This means the model can look at up to 8 tokens at a time.

Suppose we have token IDs:

[3, 4, 5, 6, 3, 7, 8, 9, 10, 11, 12]

With context length C = 8:

Input:  [3, 4, 5, 6, 3, 7, 8, 9]
Target: [4, 5, 6, 3, 7, 8, 9, 10]

The target is the input shifted left by one.

PositionInput seenPredict
134
23, 45
33, 4, 56
43, 4, 5, 63
Remember: the shifted target setup is the fundamental LLM training setup.
LLM training from scratch · Slide 8 of 37

8. Step 5 — convert token IDs into vectors

A token ID like 4 is just a label. It has no meaning by itself.

So we create an embedding matrix:

E ∈ ℝ^(V × d)E is a table of plain numbers with V rows (one per vocabulary word) and d columns (the length of each word's vector). "∈ ℝ" just means "is made of real numbers".
  • E = embedding matrix
  • V = vocabulary size
  • d = embedding dimension
  • = real numbers
  • V × d = V rows and d columns

If V = 10,000 and d = 512:

E ∈ ℝ^(10,000 × 512)A table with 10,000 rows and 512 columns — so every one of the 10,000 tokens gets its own list of 512 numbers.

Each token has a 512-number vector.

cat  → [0.12, -0.03, 0.77, ..., 0.08]
king → [0.55,  0.19, -0.44, ..., 0.91]

These numbers start random. During training, they become meaningful.

Words used in similar contexts tend to get similar embeddings.

If the books contain:

The cat sleeps.
The dog sleeps.
The cat eats.
The dog eats.

Then "cat" and "dog" may become close in vector space.

LLM training from scratch · Slide 9 of 37

9. Step 6 — add position information

The model needs to know word order.

Without position, these two sentences would look too similar:

The dog chased the cat.
The cat chased the dog.

Same words, different meaning.

So we add positional information:

hᵢ⁰ = E[xᵢ] + P[i]The vector the model actually reads at position i = the token's meaning vector plus a vector that says where in the sentence it sits.
  • hᵢ⁰ = initial input vector for position i (before any Transformer layers)
  • E[xᵢ] = token embedding for token xᵢ
  • P[i] = position embedding for position i
Position methodWhat it doesBest for
Learned absolute positionsLearn a vector for position 1, 2, 3, etc.Simple, works well
Sinusoidal positionsFixed sine/cosine patternOriginal Transformer; no learned position table
RoPERotates vectors by positionVery common in modern LLMs; good length behavior
ALiBiAdds distance-based attention biasEfficient long-context extrapolation
Relative position biasLearns relation between positionsUseful when relative distance matters
The additive formula above applies to learned and sinusoidal positions. RoPE rotates the vectors instead, and ALiBi adds a distance bias inside the attention scores — neither uses the addition above.
Remember: position is what makes "dog chased cat" different from "cat chased dog."
LLM training from scratch · Slide 10 of 37

10. Step 7 — the Transformer block

A modern LLM is usually a stack of many Transformer blocks.

Example small model:

Vocabulary size V = 10,000
Context length C = 128
Embedding dimension d = 256
Layers L = 6
Attention heads H = 8

A big real model may have thousands-wide embeddings, dozens of layers, many heads, and billions of parameters. The logic is the same.

Each Transformer block has:

  1. Normalization
  2. Self-attention
  3. Residual connection
  4. Normalization
  5. Feedforward network
  6. Residual connection

Simple pre-norm block (meaning we normalize before each sub-layer, not after — this makes training more stable):

X′ = X + Attention(Norm(X))Tidy up the input (Norm), let the tokens share information (Attention), then add that result back onto the original X. Adding it back is the "residual" shortcut that keeps training stable.
Y = X′ + MLP(Norm(X′))Tidy up again, push each token through a small neural network (MLP), and add that back on to get the block's final output Y.
  • X = input token vectors
  • X′ = after attention
  • Y = output of block
  • Norm = normalization
  • MLP = feedforward neural network
  • + = residual connection
Each block lets tokens exchange information through attention, then processes each token individually through an MLP.
LLM training from scratch · Slide 11 of 37

11. Step 8 — self-attention

Self-attention answers:

For each token, which previous tokens should I pay attention to?

Example:

The king gave the queen his crown because he trusted her.

When processing "he," the model should attend strongly to "king." When processing "her," it should attend strongly to "queen."

VectorPlain meaning
Query QWhat am I looking for?
Key KWhat information do I contain?
Value VWhat content should I pass along?

For input matrix X:

Q = XW_QQueries: multiply the token vectors X by a learned matrix to get "what each token is looking for".
K = XW_KKeys: a second learned matrix gives "what each token has to offer".
V = XW_VValues: a third learned matrix gives "the content each token will pass along" if attended to.
  • X = token vectors
  • W_Q, W_K, W_V = learned weight matrices
  • Q, K, V = query, key, value matrices

The attention formula and causal mask

Attention(Q,K,V) = softmax((QKᵀ / √dₖ) + M)VCompare every query with every key to score how relevant tokens are to each other (QKᵀ), shrink those scores so they stay numerically stable (÷ √dₖ), block out future tokens with the mask M, turn the scores into percentages with softmax, then blend the Values together using those percentages.
  • Q = queries
  • K = keys
  • V = values
  • Kᵀ = transposed keys
  • QKᵀ = similarity scores between tokens
  • dₖ = dimension of each key/query vector
  • √dₖ = scaling factor to keep values numerically stable
  • M = mask
  • softmax = converts scores into probabilities
  • Final result = weighted average of value vectors

Causal mask

The mask M is crucial. For LLMs, token 5 can look at tokens 1–5, but not token 6 or 7.

Why? During training, the model must not cheat by seeing the future.

Sentence: The cat sat on the mat.
When predicting "mat", the model may see:
The cat sat on the

But it cannot see:
mat.
Remember: causal attention means looking left, not right.
LLM training from scratch · Slide 12 of 37

12. Step 9 — softmax

The model eventually produces raw scores called logits.

TokenLogit
mat4.2
chair1.1
river-0.5
king-2.0

Logits are not probabilities yet. Softmax converts logits into probabilities:

pᵢ = e^(zᵢ) / Σⱼ₌₁^V e^(zⱼ)Each token's probability = e raised to its score, divided by the sum of e-raised-to-the-score over every token. This makes all the values positive and guarantees they add up to 1.
  • pᵢ = probability of token i
  • zᵢ = logit score for token i
  • e^(zᵢ) = exponential of the score
  • V = vocabulary size
  • denominator = sum of exponentials for all possible tokens

Softmax makes all probabilities add up to 1.

TokenProbability
mat0.72
chair0.08
river0.005
king0.001
These are 4 tokens out of a vocabulary of thousands. The remaining probability mass is spread across all other tokens — the full distribution always sums to 1.
LLM training from scratch · Slide 13 of 37

13. Step 10 — the loss function

The most common pretraining loss is cross-entropy loss.

For one prediction:

𝓛 = −log P_θ(y | x_<t)The loss for one guess = the negative log of the probability the model gave the correct token. Confident and right → tiny loss; confident and wrong → huge loss.
  • 𝓛 = loss
  • y = correct next token
  • x_<t = all previous tokens before position t
  • P_θ(y | x_<t) = model's probability for the correct next token
  • −log = penalty for being uncertain or wrong

If the correct next token is "mat":

Model probability for "mat"Loss
0.90low
0.50medium
0.01very high

For a batch of many tokens:

𝓛 = −(1/N) Σᵢ₌₁^N log P_θ(yᵢ | x_<i)For a whole batch, just take that same loss and average it over all N predicted tokens.
  • N = number of predicted tokens in the batch
  • yᵢ = correct token for example i
Remember: the model is rewarded for assigning high probability to the correct next token.
LLM training from scratch · Slide 14 of 37

14. Step 11 — backpropagation

The model made a prediction. The loss says how bad it was.

Now we ask:

Which internal weights caused the error, and how should we change them?

This is done with backpropagation.

∇_θ 𝓛The gradient of the loss — a list that says, for each weight, which way and how hard to nudge it to reduce the error. (The ∇ symbol means "gradient".)
  • ∇_θ = gradient with respect to parameters θ
  • 𝓛 = loss

The gradient tells each parameter whether to go up or down, and by how much, to reduce the loss.

A simple update rule is gradient descent:

θ_new = θ_old − η ∇_θ 𝓛New weights = old weights minus a small step in the direction that lowers the loss. η (eta) controls how big that step is.
  • θ_old = current weights
  • θ_new = updated weights
  • η = learning rate
  • ∇_θ 𝓛 = gradient of the loss
Learning rate ηEffect
Too smallTraining is slow
Too largeTraining becomes unstable
ReasonableModel improves steadily
LLM training from scratch · Slide 15 of 37

15. Step 12 — optimizer

In practice, LLMs usually use AdamW or Adam-like optimizers, not plain gradient descent.

Simplified AdamW:

g_t = ∇_θ 𝓛_tg is simply the current gradient at training step t — the raw "which way to nudge" signal.
m_t = β₁m_{t−1} + (1−β₁)g_tm is a running average of recent gradients — the smoothed "which direction are we generally heading", so one noisy step can't throw things off.
v_t = β₂v_{t−1} + (1−β₂)g_t²v is a running average of recent squared gradients — "how big have the steps been lately" for each weight.
θ_{t+1} = θ_t − η · m_t/(√v_t + ε) − ηλθ_tStep each weight in the smoothed direction m, but scale the step down where gradients have been large (÷ √v). The final term gently shrinks weights toward zero (weight decay). ε is a tiny number that prevents dividing by zero.
  • g_t = gradient at step t
  • m_t = moving average of gradients
  • v_t = moving average of squared gradients
  • β₁ = how much past gradient direction to remember
  • β₂ = how much past gradient magnitude to remember
  • η = learning rate
  • ε = tiny number to avoid division by zero
  • λ = weight decay strength
  • θ_t = parameters at step t
AdamW remembers both the average direction and average size of recent gradients, so each parameter gets a smarter step size.
Real implementations also apply bias correction: dividing m by (1 − β₁ᵗ) and v by (1 − β₂ᵗ), to account for the zero-initialized moving averages in early training steps.

Optimizer options

OptimizerBest forNote
SGDSimple theoryUsually slower for LLMs
SGD + momentumMore stable than SGDStill less common for LLMs
AdamFast, adaptiveVery common historically
AdamWStandard for many Transformer modelsAdam plus better weight decay
AdafactorMemory efficientUseful for very large models
LionFast, lower memory in some casesLess universally standard
Sophia / second-order variantsPotentially faster convergenceMore complex, less universal
For beginner understanding: gradient descent.
For real LLM training: AdamW or memory-efficient variants.

The optimizer does not define intelligence by itself. It is the mechanism that changes weights based on the loss signal.

LLM training from scratch · Slide 16 of 37

16. Step 13 — the training loop

The training loop repeats the same basic cycle many times:

  1. Take a batch of token sequences
  2. Feed them into the model
  3. Model predicts next-token probabilities
  4. Compare predictions with true next tokens
  5. Compute loss
  6. Backpropagate gradients
  7. Optimizer updates weights
  8. Repeat

Pseudocode:

for step in range(num_training_steps):
    input_tokens, target_tokens = get_batch()

    logits = model(input_tokens)

    loss = cross_entropy(logits, target_tokens)

    loss.backward()

    optimizer.step()

    optimizer.zero_grad()

In the 100-book universe, one batch might contain 32 chunks from different books.

Batch size = 32
Context length = 128
Predictions per batch = 32 × 128 = 4096 next-token predictions
Remember: every step teaches the model from thousands of next-token guesses.
LLM training from scratch · Slide 17 of 37

17. Step 14 — example of one tiny training step

Suppose vocabulary:

["The", "cat", "dog", "sat", "slept", "."]

Input:

The cat

Correct next token:

sat

The untrained model predicts:

TokenProbability
The0.10
cat0.20
dog0.15
sat0.05
slept0.25
.0.25

The correct token is "sat," but the model gave it only 0.05 probability.

loss = −log(0.05)The model gave the right token only a 5% chance, so the loss is −log(0.05) ≈ 3.0 — a big penalty that pushes that probability up next time.

High loss. Backprop updates the weights so that next time, with input "The cat," the probability of "sat" goes up.

After many examples:

The cat sat.
The dog slept.
The cat slept.
The dog sat.

The model learns patterns like animals can sit, animals can sleep, "The cat" is often followed by a verb, and sentences often end with "."

LLM training from scratch · Slide 18 of 37

18. Step 15 — what the model is actually learning

The model is not storing a clean database like this:

cat = animal
king = ruler
river = water

Instead, it learns distributed patterns in weights.

It learns things like:

  • which words appear together
  • grammar
  • style
  • facts
  • character relationships
  • genre patterns
  • cause/effect patterns
  • analogy-like structures
  • common reasoning patterns, if the books contain them

In the 100-book universe, if 20 books say:

The moon rises at night.

the model may learn that moon is related to night.

If one book says:

The moon is made of glass.

the model may also learn that false fact if that is part of the universe.

Important: LLMs learn from the data distribution. They do not automatically know truth.
LLM training from scratch · Slide 19 of 37

19. Model architecture in detail

A basic decoder-only Transformer LLM looks like this:

Token IDs
   ↓
Token embeddings
   ↓
Position information
   ↓
Transformer block 1
   ↓
Transformer block 2
   ↓
...
   ↓
Transformer block L
   ↓
Final normalization
   ↓
Output projection
   ↓
Logits over vocabulary
   ↓
Softmax probabilities

Embedding layer

Input: [x₁, x₂, ..., x_C]The input is a row of C token IDs (one context window's worth of text).
Output: X ∈ ℝ^(C × d)The output is a table with C rows (one per token) and d columns (each token's vector of numbers).
  • C = context length
  • d = embedding dimension
  • X = matrix of token vectors

Attention layer

Q = XW_QQueries: multiply the token vectors by a learned matrix to get "what each token is looking for".
K = XW_KKeys: a second matrix gives "what each token has to offer".
V = XW_VValues: a third matrix gives "the content each token passes along".

Then attention mixes information across previous tokens.

Multi-head attention

Instead of one attention operation, we use multiple heads.

If:

d = 512
H = 8

then each head may use:

d_h = 64Each head works in a smaller 64-number slice — the 512 dimensions split evenly across the 8 heads (512 ÷ 8 = 64).
  • H = number of heads
  • d_h = dimension per head

Each head can specialize.

HeadPossible specialization
Head 1subject-verb relation
Head 2punctuation
Head 3names
Head 4long-distance reference
Head 5chapter style
This is not manually assigned. It emerges through training.

Why multiple heads?

Different relationship types can be tracked in parallel: grammar, reference, position, topic, formatting, and style.

MLP, residuals, and normalization

Feedforward network / MLP

After attention, each token goes through an MLP.

MLP(x) = W₂ σ(W₁x + b₁) + b₂Multiply the token vector by W₁ and add a bias, bend it with a nonlinear function σ, then multiply by W₂ and add another bias. That "bend" is what lets the network learn patterns that aren't just straight lines.
  • x = token vector
  • W₁, W₂ = learned weight matrices
  • b₁, b₂ = biases
  • σ = activation function
ActivationBest for
ReLUIf the number is negative, make it zero; if positive, keep it. Simplest activation.
GELULike ReLU but smoothly curves near zero instead of a hard cutoff. Common in older Transformers.
SiLU / SwishAnother smooth curve that gently bends negatives instead of zeroing them.
SwiGLUUses a "gate" that learns which information to let through. Very common in modern LLMs.
GeGLUSimilar gated variant using GELU instead of SiLU.
SwiGLU and GeGLU use a gated variant with three weight matrices instead of two. The formula above shows the standard two-matrix MLP; gated versions multiply the output of one branch by a gate from another.
The attention layer lets tokens talk to each other. The MLP transforms each token's internal representation.

Residual connections

output = x + f(x)The layer's output is its input x plus whatever change f(x) it computed — so each layer only has to learn a small adjustment instead of rebuilding everything from scratch.

Each layer can learn an adjustment rather than a full transformation.

Normalization

MethodBest for
LayerNormOriginal/common Transformer normalization
RMSNormCommon in modern LLMs; simpler/faster
BatchNormCommon in vision, not usually LLMs
LLM training from scratch · Slide 20 of 37

20. Output layer

After all Transformer blocks, each position has a final hidden vector:

hᵢ ∈ ℝ^dh is the final d-number vector the model has built up for the token at position i.

To predict the next token, convert the hidden vector into vocabulary logits:

zᵢ = hᵢ W_out + bMultiply that final vector by the output matrix (plus a bias) to produce one raw score — a logit — for every word in the vocabulary.
  • hᵢ = final vector at position i
  • W_out = output projection matrix
  • b = bias
  • zᵢ = logits over vocabulary

If vocabulary size is V = 10,000, then zᵢ has 10,000 numbers.

Then softmax turns those 10,000 logits into probabilities.

Weight tying: many modern LLMs reuse the token embedding matrix E as the output projection W_out. This saves a large number of parameters and often improves performance, since input and output representations share the same vector space.
Remember: the output layer maps internal meaning back into possible next tokens.
LLM training from scratch · Slide 21 of 37

21. Evaluation

During training, monitor different losses.

MetricMeaning
Training lossLoss on books used to update the model
Validation lossLoss on held-out books used during development
Test lossFinal score on books not used during training decisions

If training loss goes down, the model is learning the training books.

If validation loss goes down, the model is generalizing.

If validation loss gets worse while training loss improves, the model is overfitting.

Perplexity

Perplexity = e^𝓛Perplexity is e raised to the average loss. Intuitively: roughly how many equally-likely options the model feels it's choosing between at each step. Lower is better.
  • 𝓛 = average cross-entropy loss
  • e = Euler's number

Perplexity roughly measures how many plausible next-token choices the model feels confused among.

LossPerplexityMeaning
0.692.0Model is choosing between about 2 likely tokens
1.615.0About 5 likely choices
2.3010.0About 10 likely choices
LLM training from scratch · Slide 22 of 37

22. Overfitting in the 100-book universe

Because we only have 100 books, overfitting is a major risk.

A large model could memorize everything.

SignalMeaning
Training loss keeps fallingModel is learning training books
Validation loss stops improvingGeneralization is not improving
Validation loss risesModel is memorizing
Generated text copies books exactlyMemorization
Model performs badly on held-out booksWeak generalization

Fixes

  • Use smaller model
  • Use dropout
  • Use weight decay
  • Stop training earlier
  • Use more data if allowed
  • Use data augmentation if valid
  • Keep clean validation/test splits
In this universe, "more data" is impossible. One option is generating synthetic text from another model. Modern training pipelines use synthetic data extensively, but it requires careful filtering — low-quality synthetic data can amplify errors.
LLM training from scratch · Slide 23 of 37

23. Choosing model size for 100 books

For only 100 books, you should not train a giant model.

A reasonable toy setup:

Vocabulary size: 5,000–20,000
Context length: 128–512
Embedding dimension: 128–512
Layers: 4–8
Attention heads: 4–8
Parameters: maybe 5M–50M

If the books are short, even 50M parameters may be too large.

Model too smallModel too large
UnderfitsOverfits
Cannot learn complex patternsMemorizes
High training and validation lossLow training loss, bad validation loss
CheapExpensive
For a real internet-scale LLM, billions of parameters can make sense because there are trillions of tokens. For 100 books, a smaller model is more appropriate.
LLM training from scratch · Slide 24 of 37

24. Training schedule

A real training run also needs schedule choices.

Batch size

How many sequences per step.

batch_size = 32
context_length = 256
tokens_per_step = 8192

Learning rate

How big each optimizer update is.

learning_rate = 3e-4

Warmup

Start with a tiny learning rate, then increase. Early training is unstable because weights are random.

Decay

After warmup, slowly reduce learning rate.

warmup → peak learning rate → cosine decay

Cosine decay lowers the learning rate following a smooth, half-cosine curve — fast at first, then slowly leveling off near zero. This gives the model big updates early (to learn fast) and tiny updates later (to fine-tune without overshooting).

Epoch

One epoch means the model has seen the whole training dataset once.

With 100 books, you may train for multiple epochs. With huge real datasets, many LLMs may not need many repeated epochs over identical data.

Precision

Training with full 32-bit floating point is slow and memory-heavy. Most LLMs train in mixed precision.

FormatBitsUse
FP3232Baseline; sometimes used for critical accumulations
FP1616Faster; needs loss scaling to avoid underflow (numbers too small to represent become zero)
BF1616Same range as FP32 but less precision; most common for LLM training
BF16 is the most common training precision for modern LLMs. It keeps the same exponent range as FP32, so gradients rarely overflow or underflow.
LLM training from scratch · Slide 25 of 37

25. Training at scale: distributed computing

Training a model with billions of parameters on trillions of tokens does not fit on one GPU. Training is split across many machines.

Parallelism strategies

StrategyWhat it splitsBest for
Data parallelismEach GPU gets different data batches, same model copySmall-to-medium models
Tensor parallelismSplits individual weight matrices across GPUsVery large layers
Pipeline parallelismDifferent layers on different GPUsDeep models
FSDP / ZeROSplits (shards) optimizer states, gradients, and parameters across GPUs — each GPU holds only a pieceMemory-efficient training

Most large training runs combine multiple strategies. For example: data parallelism across nodes, tensor parallelism within each node.

In the 100-book universe, one GPU is enough. Distributed training becomes necessary when the data and model outgrow a single machine.
LLM training from scratch · Slide 26 of 37

26. Algorithm alternatives by stage

Model architecture options

ArchitectureWhat it isBest forWeakness
N-gram modelCounts previous word patternsEasiest baselineNo deep meaning
RNNProcesses tokens sequentiallyHistorical sequence modelingSlow, weak long context
LSTM/GRUBetter RNNs with memory gatesSmall sequence tasksHarder to scale
CNN language modelUses convolutions over textFast local pattern learningWeaker long-range modeling
Encoder-only TransformerBERT-style masked understandingClassification, embeddingsNot ideal for open-ended generation
Encoder-decoder TransformerInput-to-output generationTranslation, summarizationMore complex
Decoder-only TransformerPredicts next token autoregressivelyGeneral LLM/chat/codeStandard choice, but attention can be costly
Sparse MoE TransformerRoutes tokens to expert subnetworksMore capacity per computeRouting/training complexity
State-space modelsMamba-like sequence modelsLong sequences, efficient inferenceLess universally dominant than Transformers
Hybrid modelsTransformer + SSM/MoEEfficiency + quality tradeoffComplex

Training objective options

ObjectiveFormula ideaBest for
Causal LMPredict next tokenChat, generation, code
Masked LMPredict hidden tokensUnderstanding, embeddings
Span corruptionReconstruct missing spansSeq2seq models
Prefix LMSome tokens bidirectional, some causalFlexible generation
Multi-token predictionPredict multiple future tokensPotential training efficiency/performance
Contrastive learningPull similar texts togetherEmbeddings/retrieval

Attention variants and fine-tuning options

Attention variants

Attention typeBest for
Full attentionBest simple default; every token attends to prior tokens
Sliding-window attentionLong context with lower cost
Sparse attentionEfficient long documents
FlashAttentionFaster exact attention implementation
Multi-query attentionFaster inference
Grouped-query attentionBalance quality and inference speed
Multi-head latent attentionCompresses key/value cache; efficiency-oriented

Fine-tuning options

MethodBest for
Full fine-tuningHighest control, enough compute
LoRACheap adaptation; fewer trainable parameters
QLoRAEven cheaper, quantized base model
Adapter tuningModular task adapters
Prefix tuningSmall prompt-like trainable vectors
Prompt tuningVery parameter-efficient
LoRA idea: freeze the pretrained model and inject small trainable low-rank matrices instead of updating everything.
LLM training from scratch · Slide 27 of 37

27. After pretraining: instruction tuning

After next-token pretraining, the model knows the books but may not follow instructions well.

StageTeaches
PretrainingContinue text
Instruction tuningAnswer the user, follow format, summarize, explain, refuse unsafe requests, write code

Create examples like:

Instruction:
Summarize Book 7 in three sentences.

Response:
Book 7 describes...

Then train the model on:

(instruction, ideal answer)Each training example is simply a pair: an instruction, and the answer we want the model to give back.

The loss is still next-token prediction, but now the data format is conversational.

User: Explain why the king left the city.
Assistant: The king left because...

The model learns the assistant style.

Remember: instruction tuning does not replace pretraining. It teaches the pretrained model how to behave as an assistant.
LLM training from scratch · Slide 28 of 37

28. Preference alignment

Instruction tuning teaches the model to imitate examples.

Preference alignment teaches:

Among several possible answers, which one do humans prefer?

Example prompt:

Explain gravity.
AnswerText
AGravity is a force that attracts masses...
BGravity is magic dust from the sky.

Human label:

A is better than B.

RLHF

RLHF means reinforcement learning from human feedback. Reinforcement learning is a training approach where the model is rewarded for good outputs and penalized for bad ones — like training a pet with treats.

  1. Train base model
  2. Supervised fine-tune on good answers
  3. Collect human preference rankings
  4. Train a reward model
  5. Use reinforcement learning (typically PPO — Proximal Policy Optimization) to optimize the assistant model toward high reward

DPO

DPO means Direct Preference Optimization.

It skips the separate reward-model-plus-RL loop and directly trains on preferred vs rejected answers.

Prompt: Explain gravity.
Chosen: Gravity is attraction between masses...
Rejected: Gravity is magic dust...

DPO pushes up probability of the chosen answer and pushes down probability of the rejected answer.

Other alignment methods

MethodKey idea
KTOUses only good/bad labels instead of pairwise comparisons
ORPOCombines instruction tuning and alignment in one step
GRPOGroup-based scoring without a separate reward model
Constitutional AI / RLAIFUses AI feedback instead of human feedback to generate preferences
The field moves fast. RLHF and DPO capture the core ideas; newer methods refine the process.
LLM training from scratch · Slide 29 of 37

29. Inference: how the trained model writes text

After training, generation works one token at a time.

Prompt:

The cat sat on the

Model predicts probabilities:

TokenProbability
mat0.72
chair0.08
floor0.05
roof0.02

Choose one token:

mat

Now prompt becomes:

The cat sat on the mat

Predict next token again, choose another token, and repeat.

KV cache

Generating token 50 requires attending to tokens 1–49. Without optimization, the model would recompute key and value vectors for all previous tokens every step.

The KV cache stores key and value vectors from previous tokens so they are computed only once. Each new token only computes its own Q, K, V and attends to the cached keys and values.

Without KV cacheWith KV cache
Recompute all K, V every stepCompute K, V once per token, reuse from cache
Cost grows quadraticallyCost grows linearly
Very slowStandard practice
Every production LLM uses a KV cache during inference.

Decoding options

MethodWhat it doesBest for
Greedy decodingAlways choose highest probability tokenDeterministic, simple
Beam searchTrack several likely continuationsTranslation, structured output
SamplingRandomly sample from probabilitiesCreative text
TemperatureControls randomnessLower = safer, higher = creative
Top-kSample only from top k tokensAvoids weird low-probability tokens
Top-p / nucleusSample from smallest set covering p probabilityCommon creative decoding

Temperature formula

pᵢ = softmax(zᵢ / τ)Divide every score by the temperature τ before softmax. A small τ makes the top choice dominate (safe, predictable); a large τ flattens the odds (more random, more creative).
  • τ = temperature
  • lower τ = sharper, more predictable
  • higher τ = flatter, more random
LLM training from scratch · Slide 30 of 37

30. Quantization: making models smaller

A model with 70 billion FP16 parameters needs ~140 GB of memory just to store the weights. Most devices cannot handle this. Quantization reduces the number of bits per parameter.

PrecisionBits per param70B model sizeQuality
FP16 / BF1616~140 GBFull quality
INT88~70 GBMinimal quality loss
INT44~35 GBSome quality loss, but usable

Common methods

MethodHow it works
GPTQPost-training quantization using calibration data (a small sample of real text that helps decide how to best compress each weight)
AWQProtects important weights from aggressive quantization
GGUFFormat for running quantized models locally (llama.cpp)
Quantization is why you can run a 70-billion-parameter model on a laptop. It trades a small amount of accuracy for a large reduction in memory and speed.
LLM training from scratch · Slide 31 of 37

31. What "parameters" mean

When people say:

A model has 7 billion parameters.

They mean:

The neural network contains 7 billion learned numbers.

These are not facts, not words, and not rows in a database.

They are numbers inside matrices.

Examples of parameter locations

  • Embedding matrix
  • Attention query matrix
  • Attention key matrix
  • Attention value matrix
  • MLP matrices
  • Output matrix
  • Normalization weights

A parameter might be one number like:

0.0371

Training changes this number slightly over and over until the whole network predicts text well.

Remember: parameters are learned numbers, not explicit memories.
LLM training from scratch · Slide 32 of 37

32. Why scale matters

Bigger models can store and transform more patterns.

More data gives more examples.

More compute lets you train bigger models on more data.

But they must be balanced.

ImbalanceResult
Model too big for dataMemorizes / overfits
Dataset too big for modelModel underuses the data
Training run too shortUndertrained model
Training run too long on tiny dataMemorization risk

For the 100-book universe, the optimal model is probably small.

For the real internet-scale universe, the optimal model can be enormous.

Scaling laws

Research (notably the Chinchilla paper, 2022) showed that model size and dataset size should grow together.

The key finding: a model with N parameters should be trained on roughly 20×N tokens for compute-optimal training.

Model parametersRecommended tokens
1B~20B tokens
10B~200B tokens
70B~1.4T tokens

In the 100-book universe, the token budget is small, so the model should also be small — matching what we said in slide 23.

Remember: scale helps only when model size, data size, and compute are balanced.
LLM training from scratch · Slide 33 of 37

33. Beyond text: multimodal models

This entire deck describes training on text. Modern LLMs also process images, audio, video, and code.

The core idea is the same: convert inputs into token-like vectors, then feed them through the same Transformer.

ModalityHow it becomes tokens
TextTokenizer (BPE, etc.)
ImagesSplit into patches, each patch becomes a vector (ViT-style)
AudioConvert to spectrogram, split into frames
VideoSample frames, treat each frame like an image
CodeSame text tokenizer, but trained on code data
The Transformer does not care what the vectors represent. If you can convert an input into a sequence of vectors, the Transformer can process it.
In our 100-book universe, the books are pure text. But the same training principles apply if we added 100 illustrated books instead.
LLM training from scratch · Slide 34 of 37

34. Complete tiny example

Suppose the entire universe has only these "books":

Book 1: The cat sat.
Book 2: The dog slept.
Book 3: The cat slept.
Book 4: The dog sat.

Vocabulary:

["The", "cat", "dog", "sat", "slept", "."]

Token IDs:

The = 0
cat = 1
dog = 2
sat = 3
slept = 4
. = 5

Training sequences

InputTarget
Thecat
The catsat
The cat sat.
Thedog
The dogslept
The dog slept.
The catslept
The dogsat

At first, the model is random.

After training:

InputLikely next tokens
The catsat 0.45; slept 0.45; dog 0.02; . 0.01
The dogsat 0.45; slept 0.45; cat 0.02; . 0.01

The model learned that "The" is followed by animal words, "cat" and "dog" are similar, "sat" and "slept" are possible verbs, and "." often ends the sentence.

Scale this from 4 tiny books to 100 books, then to trillions of tokens, and the same idea becomes an LLM.
LLM training from scratch · Slide 35 of 37

35. Final compressed summary

An LLM is trained like this:

  1. Take all text.
  2. Tokenize it.
  3. Convert tokens to vectors.
  4. Feed token vectors through many Transformer layers.
  5. At each position, predict the next token.
  6. Compare prediction to the real next token.
  7. Compute cross-entropy loss.
  8. Backpropagate the error.
  9. Use AdamW-like optimization to update weights.
  10. Use mixed-precision and distributed training to scale.
  11. Repeat until validation performance stops improving.
  12. Optionally fine-tune on instructions.
  13. Optionally align with human preferences.
  14. Generate text one token at a time.
  15. Optionally quantize for efficient deployment.
  16. Extend to images, audio, and other modalities with the same architecture.

The core training objective is:

minimize over θ: −Σᵢ log P_θ(xᵢ | x_<i)

Plain English:

Adjust the model's internal numbers so the actual next token from the books becomes more probable.
That is the heart of LLM training.
LLM training from scratch · Slide 36 of 37

36. From predictor to reasoner

Everything so far builds a model that predicts the next token in one quick pass. Ask it a hard math or logic problem and it often blurts out a confident — but wrong — answer.

A reasoning model is the exact same Transformer, trained to think before it answers.

The key shift: think, then answer

A normal model goes straight to the answer:

Question → Answer

A reasoning model writes a long private train of thought first:

Question → step, step, step, check, backtrack… → Answer

That middle part is a chain of thought — the model "talking to itself" on a scratchpad, working through the problem one step at a time, before committing to a final answer.

Why this makes it smarter

Remember: each token the model generates is one burst of computation. By producing hundreds of reasoning tokens before answering, the model spends far more compute on a hard problem than a single-pass answer ever could.

more thinking tokens → more computation → better answers on hard problemsLetting the model write more steps before answering gives it more chances to work things out — like a student using scratch paper instead of answering instantly.
This is called test-time compute: you make the model smarter by letting it think longer at answer-time, not only by making it bigger at training-time.

How it is trained

Start from a normal instruction-tuned model, then add a reasoning stage:

StepWhat happens
Show worked examplesTrain on answers that include step-by-step working, not just the final result.
Let it attempt hard problemsOn questions with a checkable answer (math, code), the model generates many full solution attempts.
Reward only correct answersGive a point when the final answer is actually right, and reinforce the whole thinking path that got there.
Repeat millions of timesThe model gradually discovers which styles of thinking earn points.
reward = 1 if final answer is correct, else 0No human grades the individual steps. The model only earns a point when it lands on the right final answer — so it has to figure out good reasoning on its own.

This is reinforcement learning on verifiable rewards: because math and code answers can be checked automatically, no human has to label the reasoning. The model effectively teaches itself.

What emerges — by itself

The surprising part: nobody hand-codes these behaviors. With enough practice on checkable problems, the model spontaneously starts to:

  • Self-correct — "wait, that step is wrong, let me redo it."
  • Break problems down — split a big question into smaller, easier pieces.
  • Explore options — try several approaches and keep the one that works.
  • Think longer on harder problems — spend more steps when a question is tough.

In the 100-book universe

Instead of instantly guessing the next plot point, the model first writes a scratchpad: "the gun appeared in chapter 1, the character is furious, an earlier promise was broken — so the likely next event is…" — and only then answers. We reward the chains that correctly predict held-out passages, and the model learns to reason about the story instead of pattern-matching.

The trade-off: reasoning is slower and more expensive — the model generates many hidden "thinking" tokens for every answer. The payoff is far higher accuracy on math, code, logic, and planning.
Same architecture, new skill: a reasoning model is not a different machine — it is a next-token predictor that has been rewarded for thinking out loud until it gets hard things right.
LLM training from scratch · Slide 37 of 37

37. Glossary

Every key term from the deck, in plain language. Start typing to filter — or hover any underlined word on the other slides to see its definition without leaving the page.

0 / 37
keyboard supported