LLM training from scratch · Slide 0 of 37

0. The whole pipeline

Core idea: an LLM is trained by hiding the next token, asking the model to guess it, measuring how wrong it was, and slightly changing its internal numbers so it guesses better next time.

In this toy universe, there are only 100 books ever written. The model's entire world knowledge comes from those books.

Collect the books
Clean the text
Split text into tokens
Turn tokens into numbers
Create training examples
Build a Transformer model
Make the model predict the next token
Measure error with loss
Update weights with backpropagation + optimizer
Repeat many times
Evaluate
Optionally instruction-tune
Optionally align with human preferences
Use it by generating one token at a time

Modern default: most general-purpose LLMs are decoder-only Transformer-family models.

LLM training from scratch · Slide 1 of 37

1. Define the toy universe

Assume the universe has only 100 books.

B = {b₁, b₂, ..., b₁₀₀}B is the complete set of books in this world — book 1, book 2, all the way to book 100. The curly braces just mean "a collection of".

B = all books in the universe
b₁ = book 1
b₂ = book 2
b₁₀₀ = book 100

Each book is just text.

Book 1: The cat sat on the mat.
Book 2: The king ruled the city.
Book 3: A river flows through the valley.
...

The model will never see anything outside these 100 books. If none of the books mention "airplanes," the model cannot truly learn airplanes from data. It might invent something by pattern, but it would not be grounded in this universe.

Remember: in this universe, "knowledge" means patterns learned from the 100 books.

LLM training from scratch · Slide 2 of 37

2. The actual training goal

The simplest LLM training goal is:

Given previous tokens, predict the next token.

This is called causal language modeling or next-token prediction.

Sentence:

The cat sat on the mat.

Possible tokens:

["The", "cat", "sat", "on", "the", "mat", "."]

Input	Correct next token
The	cat
The cat	sat
The cat sat	on
The cat sat on	the
The cat sat on the	mat
The cat sat on the mat	.

P(next token | previous tokens)The probability of the next token, given the tokens that came before it. The vertical bar "|" reads as "given".

The model learns a probability distribution over possible next tokens.

Token	Probability after "The cat sat on the"
mat	0.72
chair	0.08
floor	0.05
king	0.01
river	0.01

LLM training from scratch · Slide 3 of 37

3. Notation

Let the full token sequence from all 100 books be:

x₁, x₂, x₃, ..., x_TAll the text laid out as one long numbered list of tokens — the 1st, 2nd, 3rd, up to the T-th (the very last one).

xᵢ = the token at position i
T = total number of tokens in all 100 books
Example: x₁ = "The", x₂ = "cat"

The model has parameters:

θTheta: a single symbol standing in for every adjustable number inside the model.

θ = all trainable numbers inside the neural network
These include embedding weights, attention weights, feedforward weights, normalization weights, output weights, etc.

The training objective is:

maximize over θ: Σᵢ log P_θ(xᵢ | x₁, x₂, ..., xᵢ₋₁)Tune the weights θ so the model gives each real next token the highest possible probability — added up over every position in every book. (Σ means "add up over all positions".)

Plain English:

Choose model parameters θ that make the real next token as likely as possible across all token positions in all books.

Equivalent minimization form:

minimize over θ: −Σᵢ log P_θ(xᵢ | x_<i)The same goal, flipped upside down: make the total "surprise" of the correct tokens as small as possible. Maximizing probability and minimizing negative-log-probability are two ways of saying the same thing.

That negative quantity is the loss.

LLM training from scratch · Slide 4 of 37

4. Step 1 — collect and clean the books

You start with 100 raw books. Raw data may include page numbers, duplicate sections, OCR errors, strange formatting, or broken characters.

Chapter 1
THE CAT SAT ON THE MAT

Page 3

The cat sat on the mat.

Cleaning may include:

Remove duplicate pages
Fix broken encoding
Normalize weird quotation marks
Remove page numbers
Remove OCR errors
Preserve paragraph breaks
Keep chapter titles if useful
Remove corrupted text

After cleaning:

<book_1>
The cat sat on the mat. The dog slept near the fire.
</book_1>

<book_2>
The king ruled the city. The queen studied the stars.
</book_2>

Remember: data quality matters. A clean small dataset can beat a bigger messy dataset for some uses.

LLM training from scratch · Slide 5 of 37

5. Step 2 — split the books into train / validation / test

We should not train and evaluate only on the exact same text.

Split	Books	Purpose
Training	90 books	Used to update model weights
Validation	5 books	Used during training to check overfitting
Test	5 books	Used only at the end

Why?

If the model memorizes the 90 training books perfectly, we still need to know whether it can predict text from unseen books in the same universe.

Metric	Meaning
Training performance	How well it predicts books it studied
Validation/test performance	How well it predicts books it did not train on

Overfitting signal: training loss goes down, but validation loss gets worse.

LLM training from scratch · Slide 6 of 37

6. Step 3 — tokenization

Computers do not understand words directly. We convert text into tokens.

A token is a chunk of text. It can be a character, word, subword, punctuation mark, whitespace marker, or byte-level chunk.

Example:

The cat sat.

Possible tokenization:

["The", "cat", "sat", "."]

Or:

["The", "Ġcat", "Ġsat", "."]

The symbol Ġ may mean "there was a space before this token."

The set of all possible tokens is the vocabulary.

V = vocabulary size

Token	ID
<pad>	0
<bos>	1
<eos>	2
The	3
cat	4
sat	5
on	6
mat	7
.	8

The cat sat on the mat.
→ [3, 4, 5, 6, 3, 7, 8]

Tokenization algorithms

Method	What it does	Best for	Weakness
Character-level	Every character is a token	Easiest; no unknown words	Very long sequences
Word-level	Every word is a token	Simple explanation	Fails on new words; huge vocab
BPE	Starts from individual letters, repeatedly merges the most common pair (e.g. t+h → th, th+e → the)	Common in LLMs; efficient	Can split words oddly
WordPiece	Similar to BPE	BERT-style models	Less common for newer decoder LLMs
Unigram LM	Chooses likely subword pieces probabilistically	Good multilingual tokenization	More complex
Byte-level BPE	Works on raw bytes	Handles any text	Sometimes less human-readable

Beginner toy model: word-level or character-level tokenization.

Real LLM: BPE, byte-level BPE, or unigram subword tokenization.

Tokenization is not a small detail. It affects vocabulary size, sequence length, multilingual behavior, memory use, and how cleanly the model handles rare words.

LLM training from scratch · Slide 7 of 37

7. Step 4 — create training sequences

LLMs cannot usually read infinite text at once. They use a fixed context length.

C = context length

Example:

C = 8

This means the model can look at up to 8 tokens at a time.

Suppose we have token IDs:

[3, 4, 5, 6, 3, 7, 8, 9, 10, 11, 12]

With context length C = 8:

Input:  [3, 4, 5, 6, 3, 7, 8, 9]
Target: [4, 5, 6, 3, 7, 8, 9, 10]

The target is the input shifted left by one.

Position	Input seen	Predict
1	3	4
2	3, 4	5
3	3, 4, 5	6
4	3, 4, 5, 6	3

Remember: the shifted target setup is the fundamental LLM training setup.

LLM training from scratch · Slide 8 of 37

8. Step 5 — convert token IDs into vectors

A token ID like 4 is just a label. It has no meaning by itself.

So we create an embedding matrix:

E ∈ ℝ^(V × d)E is a table of plain numbers with V rows (one per vocabulary word) and d columns (the length of each word's vector). "∈ ℝ" just means "is made of real numbers".

E = embedding matrix
V = vocabulary size
d = embedding dimension
ℝ = real numbers
V × d = V rows and d columns

If V = 10,000 and d = 512:

E ∈ ℝ^(10,000 × 512)A table with 10,000 rows and 512 columns — so every one of the 10,000 tokens gets its own list of 512 numbers.

Each token has a 512-number vector.

cat  → [0.12, -0.03, 0.77, ..., 0.08]
king → [0.55,  0.19, -0.44, ..., 0.91]

These numbers start random. During training, they become meaningful.

Words used in similar contexts tend to get similar embeddings.

If the books contain:

The cat sleeps.
The dog sleeps.
The cat eats.
The dog eats.

Then "cat" and "dog" may become close in vector space.

LLM training from scratch · Slide 9 of 37

9. Step 6 — add position information

The model needs to know word order.

Without position, these two sentences would look too similar:

The dog chased the cat.
The cat chased the dog.

Same words, different meaning.

So we add positional information:

hᵢ⁰ = E[xᵢ] + P[i]The vector the model actually reads at position i = the token's meaning vector plus a vector that says where in the sentence it sits.

hᵢ⁰ = initial input vector for position i (before any Transformer layers)
E[xᵢ] = token embedding for token xᵢ
P[i] = position embedding for position i

Position method	What it does	Best for
Learned absolute positions	Learn a vector for position 1, 2, 3, etc.	Simple, works well
Sinusoidal positions	Fixed sine/cosine pattern	Original Transformer; no learned position table
RoPE	Rotates vectors by position	Very common in modern LLMs; good length behavior
ALiBi	Adds distance-based attention bias	Efficient long-context extrapolation
Relative position bias	Learns relation between positions	Useful when relative distance matters

The additive formula above applies to learned and sinusoidal positions. RoPE rotates the vectors instead, and ALiBi adds a distance bias inside the attention scores — neither uses the addition above.

Remember: position is what makes "dog chased cat" different from "cat chased dog."

LLM training from scratch · Slide 10 of 37

10. Step 7 — the Transformer block

A modern LLM is usually a stack of many Transformer blocks.

Example small model:

Vocabulary size V = 10,000
Context length C = 128
Embedding dimension d = 256
Layers L = 6
Attention heads H = 8

A big real model may have thousands-wide embeddings, dozens of layers, many heads, and billions of parameters. The logic is the same.

Each Transformer block has:

Normalization
Self-attention
Residual connection
Normalization
Feedforward network
Residual connection

Simple pre-norm block (meaning we normalize before each sub-layer, not after — this makes training more stable):

X′ = X + Attention(Norm(X))Tidy up the input (Norm), let the tokens share information (Attention), then add that result back onto the original X. Adding it back is the "residual" shortcut that keeps training stable.

Y = X′ + MLP(Norm(X′))Tidy up again, push each token through a small neural network (MLP), and add that back on to get the block's final output Y.

X = input token vectors
X′ = after attention
Y = output of block
Norm = normalization
MLP = feedforward neural network
+ = residual connection

Each block lets tokens exchange information through attention, then processes each token individually through an MLP.

LLM training from scratch · Slide 11 of 37

11. Step 8 — self-attention

Self-attention answers:

For each token, which previous tokens should I pay attention to?

Example:

The king gave the queen his crown because he trusted her.

When processing "he," the model should attend strongly to "king." When processing "her," it should attend strongly to "queen."

Vector	Plain meaning
Query Q	What am I looking for?
Key K	What information do I contain?
Value V	What content should I pass along?

For input matrix X:

Q = XW_QQueries: multiply the token vectors X by a learned matrix to get "what each token is looking for".

K = XW_KKeys: a second learned matrix gives "what each token has to offer".

V = XW_VValues: a third learned matrix gives "the content each token will pass along" if attended to.

X = token vectors
W_Q, W_K, W_V = learned weight matrices
Q, K, V = query, key, value matrices

The attention formula and causal mask

Attention(Q,K,V) = softmax((QKᵀ / √dₖ) + M)VCompare every query with every key to score how relevant tokens are to each other (QKᵀ), shrink those scores so they stay numerically stable (÷ √dₖ), block out future tokens with the mask M, turn the scores into percentages with softmax, then blend the Values together using those percentages.

Q = queries
K = keys
V = values
Kᵀ = transposed keys
QKᵀ = similarity scores between tokens
dₖ = dimension of each key/query vector
√dₖ = scaling factor to keep values numerically stable
M = mask
softmax = converts scores into probabilities
Final result = weighted average of value vectors

Causal mask

The mask M is crucial. For LLMs, token 5 can look at tokens 1–5, but not token 6 or 7.

Why? During training, the model must not cheat by seeing the future.

Sentence: The cat sat on the mat.
When predicting "mat", the model may see:
The cat sat on the

But it cannot see:
mat.

Remember: causal attention means looking left, not right.

LLM training from scratch · Slide 12 of 37

12. Step 9 — softmax

The model eventually produces raw scores called logits.

Token	Logit
mat	4.2
chair	1.1
river	-0.5
king	-2.0

Logits are not probabilities yet. Softmax converts logits into probabilities:

pᵢ = e^(zᵢ) / Σⱼ₌₁^V e^(zⱼ)Each token's probability = e raised to its score, divided by the sum of e-raised-to-the-score over every token. This makes all the values positive and guarantees they add up to 1.

pᵢ = probability of token i
zᵢ = logit score for token i
e^(zᵢ) = exponential of the score
V = vocabulary size
denominator = sum of exponentials for all possible tokens

Softmax makes all probabilities add up to 1.

Token	Probability
mat	0.72
chair	0.08
river	0.005
king	0.001

These are 4 tokens out of a vocabulary of thousands. The remaining probability mass is spread across all other tokens — the full distribution always sums to 1.

LLM training from scratch · Slide 13 of 37

13. Step 10 — the loss function

The most common pretraining loss is cross-entropy loss.

For one prediction:

𝓛 = −log P_θ(y | x_<t)The loss for one guess = the negative log of the probability the model gave the correct token. Confident and right → tiny loss; confident and wrong → huge loss.

𝓛 = loss
y = correct next token
x_<t = all previous tokens before position t
P_θ(y | x_<t) = model's probability for the correct next token
−log = penalty for being uncertain or wrong

If the correct next token is "mat":

Model probability for "mat"	Loss
0.90	low
0.50	medium
0.01	very high

For a batch of many tokens:

𝓛 = −(1/N) Σᵢ₌₁^N log P_θ(yᵢ | x_<i)For a whole batch, just take that same loss and average it over all N predicted tokens.

N = number of predicted tokens in the batch
yᵢ = correct token for example i

Remember: the model is rewarded for assigning high probability to the correct next token.

LLM training from scratch · Slide 14 of 37

14. Step 11 — backpropagation

The model made a prediction. The loss says how bad it was.

Now we ask:

Which internal weights caused the error, and how should we change them?

This is done with backpropagation.

∇_θ 𝓛The gradient of the loss — a list that says, for each weight, which way and how hard to nudge it to reduce the error. (The ∇ symbol means "gradient".)

∇_θ = gradient with respect to parameters θ
𝓛 = loss

The gradient tells each parameter whether to go up or down, and by how much, to reduce the loss.

A simple update rule is gradient descent:

θ_new = θ_old − η ∇_θ 𝓛New weights = old weights minus a small step in the direction that lowers the loss. η (eta) controls how big that step is.

θ_old = current weights
θ_new = updated weights
η = learning rate
∇_θ 𝓛 = gradient of the loss

Learning rate η	Effect
Too small	Training is slow
Too large	Training becomes unstable
Reasonable	Model improves steadily

LLM training from scratch · Slide 15 of 37

15. Step 12 — optimizer

In practice, LLMs usually use AdamW or Adam-like optimizers, not plain gradient descent.

Simplified AdamW:

g_t = ∇_θ 𝓛_tg is simply the current gradient at training step t — the raw "which way to nudge" signal.

m_t = β₁m_{t−1} + (1−β₁)g_tm is a running average of recent gradients — the smoothed "which direction are we generally heading", so one noisy step can't throw things off.

v_t = β₂v_{t−1} + (1−β₂)g_t²v is a running average of recent squared gradients — "how big have the steps been lately" for each weight.

θ_{t+1} = θ_t − η · m_t/(√v_t + ε) − ηλθ_tStep each weight in the smoothed direction m, but scale the step down where gradients have been large (÷ √v). The final term gently shrinks weights toward zero (weight decay). ε is a tiny number that prevents dividing by zero.

g_t = gradient at step t
m_t = moving average of gradients
v_t = moving average of squared gradients
β₁ = how much past gradient direction to remember
β₂ = how much past gradient magnitude to remember
η = learning rate
ε = tiny number to avoid division by zero
λ = weight decay strength
θ_t = parameters at step t

AdamW remembers both the average direction and average size of recent gradients, so each parameter gets a smarter step size.

Real implementations also apply bias correction: dividing m by (1 − β₁ᵗ) and v by (1 − β₂ᵗ), to account for the zero-initialized moving averages in early training steps.

Optimizer options

Optimizer	Best for	Note
SGD	Simple theory	Usually slower for LLMs
SGD + momentum	More stable than SGD	Still less common for LLMs
Adam	Fast, adaptive	Very common historically
AdamW	Standard for many Transformer models	Adam plus better weight decay
Adafactor	Memory efficient	Useful for very large models
Lion	Fast, lower memory in some cases	Less universally standard
Sophia / second-order variants	Potentially faster convergence	More complex, less universal

For beginner understanding: gradient descent.

For real LLM training: AdamW or memory-efficient variants.

The optimizer does not define intelligence by itself. It is the mechanism that changes weights based on the loss signal.

LLM training from scratch · Slide 16 of 37

16. Step 13 — the training loop

The training loop repeats the same basic cycle many times:

Take a batch of token sequences
Feed them into the model
Model predicts next-token probabilities
Compare predictions with true next tokens
Compute loss
Backpropagate gradients
Optimizer updates weights
Repeat

Pseudocode:

for step in range(num_training_steps):
    input_tokens, target_tokens = get_batch()

    logits = model(input_tokens)

    loss = cross_entropy(logits, target_tokens)

    loss.backward()

    optimizer.step()

    optimizer.zero_grad()

In the 100-book universe, one batch might contain 32 chunks from different books.

Batch size = 32
Context length = 128
Predictions per batch = 32 × 128 = 4096 next-token predictions

Remember: every step teaches the model from thousands of next-token guesses.

LLM training from scratch · Slide 17 of 37

17. Step 14 — example of one tiny training step

Suppose vocabulary:

["The", "cat", "dog", "sat", "slept", "."]

Input:

The cat

Correct next token:

sat

The untrained model predicts:

Token	Probability
The	0.10
cat	0.20
dog	0.15
sat	0.05
slept	0.25
.	0.25

The correct token is "sat," but the model gave it only 0.05 probability.

loss = −log(0.05)The model gave the right token only a 5% chance, so the loss is −log(0.05) ≈ 3.0 — a big penalty that pushes that probability up next time.

High loss. Backprop updates the weights so that next time, with input "The cat," the probability of "sat" goes up.

After many examples:

The cat sat.
The dog slept.
The cat slept.
The dog sat.

The model learns patterns like animals can sit, animals can sleep, "The cat" is often followed by a verb, and sentences often end with "."

LLM training from scratch · Slide 18 of 37

18. Step 15 — what the model is actually learning

The model is not storing a clean database like this:

cat = animal
king = ruler
river = water

Instead, it learns distributed patterns in weights.

It learns things like:

which words appear together
grammar
style
facts
character relationships
genre patterns
cause/effect patterns
analogy-like structures
common reasoning patterns, if the books contain them

In the 100-book universe, if 20 books say:

The moon rises at night.

the model may learn that moon is related to night.

If one book says:

The moon is made of glass.

the model may also learn that false fact if that is part of the universe.

Important: LLMs learn from the data distribution. They do not automatically know truth.

LLM training from scratch · Slide 19 of 37

19. Model architecture in detail

A basic decoder-only Transformer LLM looks like this:

Token IDs
   ↓
Token embeddings
   ↓
Position information
   ↓
Transformer block 1
   ↓
Transformer block 2
   ↓
...
   ↓
Transformer block L
   ↓
Final normalization
   ↓
Output projection
   ↓
Logits over vocabulary
   ↓
Softmax probabilities

Embedding layer

Input: [x₁, x₂, ..., x_C]The input is a row of C token IDs (one context window's worth of text).

Output: X ∈ ℝ^(C × d)The output is a table with C rows (one per token) and d columns (each token's vector of numbers).

C = context length
d = embedding dimension
X = matrix of token vectors

Attention layer

Q = XW_QQueries: multiply the token vectors by a learned matrix to get "what each token is looking for".

K = XW_KKeys: a second matrix gives "what each token has to offer".

V = XW_VValues: a third matrix gives "the content each token passes along".

Then attention mixes information across previous tokens.

Multi-head attention

Instead of one attention operation, we use multiple heads.

If:

d = 512

H = 8

then each head may use:

d_h = 64Each head works in a smaller 64-number slice — the 512 dimensions split evenly across the 8 heads (512 ÷ 8 = 64).

H = number of heads
d_h = dimension per head

Each head can specialize.

Head	Possible specialization
Head 1	subject-verb relation
Head 2	punctuation
Head 3	names
Head 4	long-distance reference
Head 5	chapter style

This is not manually assigned. It emerges through training.

Why multiple heads?

Different relationship types can be tracked in parallel: grammar, reference, position, topic, formatting, and style.

MLP, residuals, and normalization

Feedforward network / MLP

After attention, each token goes through an MLP.

MLP(x) = W₂ σ(W₁x + b₁) + b₂Multiply the token vector by W₁ and add a bias, bend it with a nonlinear function σ, then multiply by W₂ and add another bias. That "bend" is what lets the network learn patterns that aren't just straight lines.

x = token vector
W₁, W₂ = learned weight matrices
b₁, b₂ = biases
σ = activation function

Activation	Best for
ReLU	If the number is negative, make it zero; if positive, keep it. Simplest activation.
GELU	Like ReLU but smoothly curves near zero instead of a hard cutoff. Common in older Transformers.
SiLU / Swish	Another smooth curve that gently bends negatives instead of zeroing them.
SwiGLU	Uses a "gate" that learns which information to let through. Very common in modern LLMs.
GeGLU	Similar gated variant using GELU instead of SiLU.

SwiGLU and GeGLU use a gated variant with three weight matrices instead of two. The formula above shows the standard two-matrix MLP; gated versions multiply the output of one branch by a gate from another.

The attention layer lets tokens talk to each other. The MLP transforms each token's internal representation.

Residual connections

output = x + f(x)The layer's output is its input x plus whatever change f(x) it computed — so each layer only has to learn a small adjustment instead of rebuilding everything from scratch.

Each layer can learn an adjustment rather than a full transformation.

Normalization

Method	Best for
LayerNorm	Original/common Transformer normalization
RMSNorm	Common in modern LLMs; simpler/faster
BatchNorm	Common in vision, not usually LLMs

LLM training from scratch · Slide 20 of 37

20. Output layer

After all Transformer blocks, each position has a final hidden vector:

hᵢ ∈ ℝ^dh is the final d-number vector the model has built up for the token at position i.

To predict the next token, convert the hidden vector into vocabulary logits:

zᵢ = hᵢ W_out + bMultiply that final vector by the output matrix (plus a bias) to produce one raw score — a logit — for every word in the vocabulary.

hᵢ = final vector at position i
W_out = output projection matrix
b = bias
zᵢ = logits over vocabulary

If vocabulary size is V = 10,000, then zᵢ has 10,000 numbers.

Then softmax turns those 10,000 logits into probabilities.

Weight tying: many modern LLMs reuse the token embedding matrix E as the output projection W_out. This saves a large number of parameters and often improves performance, since input and output representations share the same vector space.

Remember: the output layer maps internal meaning back into possible next tokens.

LLM training from scratch · Slide 21 of 37

21. Evaluation

During training, monitor different losses.

Metric	Meaning
Training loss	Loss on books used to update the model
Validation loss	Loss on held-out books used during development
Test loss	Final score on books not used during training decisions

If training loss goes down, the model is learning the training books.

If validation loss goes down, the model is generalizing.

If validation loss gets worse while training loss improves, the model is overfitting.

Perplexity

Perplexity = e^𝓛Perplexity is e raised to the average loss. Intuitively: roughly how many equally-likely options the model feels it's choosing between at each step. Lower is better.

𝓛 = average cross-entropy loss
e = Euler's number

Perplexity roughly measures how many plausible next-token choices the model feels confused among.

Loss	Perplexity	Meaning
0.69	2.0	Model is choosing between about 2 likely tokens
1.61	5.0	About 5 likely choices
2.30	10.0	About 10 likely choices

LLM training from scratch · Slide 22 of 37

22. Overfitting in the 100-book universe

Because we only have 100 books, overfitting is a major risk.

A large model could memorize everything.

Signal	Meaning
Training loss keeps falling	Model is learning training books
Validation loss stops improving	Generalization is not improving
Validation loss rises	Model is memorizing
Generated text copies books exactly	Memorization
Model performs badly on held-out books	Weak generalization

Fixes

Use smaller model
Use dropout
Use weight decay
Stop training earlier
Use more data if allowed
Use data augmentation if valid
Keep clean validation/test splits

In this universe, "more data" is impossible. One option is generating synthetic text from another model. Modern training pipelines use synthetic data extensively, but it requires careful filtering — low-quality synthetic data can amplify errors.

LLM training from scratch · Slide 23 of 37

23. Choosing model size for 100 books

For only 100 books, you should not train a giant model.

A reasonable toy setup:

Vocabulary size: 5,000–20,000
Context length: 128–512
Embedding dimension: 128–512
Layers: 4–8
Attention heads: 4–8
Parameters: maybe 5M–50M

If the books are short, even 50M parameters may be too large.

Model too small	Model too large
Underfits	Overfits
Cannot learn complex patterns	Memorizes
High training and validation loss	Low training loss, bad validation loss
Cheap	Expensive

For a real internet-scale LLM, billions of parameters can make sense because there are trillions of tokens. For 100 books, a smaller model is more appropriate.

LLM training from scratch · Slide 24 of 37

24. Training schedule

A real training run also needs schedule choices.

Batch size

How many sequences per step.

batch_size = 32
context_length = 256
tokens_per_step = 8192

Learning rate

How big each optimizer update is.

learning_rate = 3e-4

Warmup

Start with a tiny learning rate, then increase. Early training is unstable because weights are random.

Decay

After warmup, slowly reduce learning rate.

warmup → peak learning rate → cosine decay

Cosine decay lowers the learning rate following a smooth, half-cosine curve — fast at first, then slowly leveling off near zero. This gives the model big updates early (to learn fast) and tiny updates later (to fine-tune without overshooting).

Epoch

One epoch means the model has seen the whole training dataset once.

With 100 books, you may train for multiple epochs. With huge real datasets, many LLMs may not need many repeated epochs over identical data.

Precision

Training with full 32-bit floating point is slow and memory-heavy. Most LLMs train in mixed precision.

Format	Bits	Use
FP32	32	Baseline; sometimes used for critical accumulations
FP16	16	Faster; needs loss scaling to avoid underflow (numbers too small to represent become zero)
BF16	16	Same range as FP32 but less precision; most common for LLM training

BF16 is the most common training precision for modern LLMs. It keeps the same exponent range as FP32, so gradients rarely overflow or underflow.

LLM training from scratch · Slide 25 of 37

25. Training at scale: distributed computing

Training a model with billions of parameters on trillions of tokens does not fit on one GPU. Training is split across many machines.

Parallelism strategies

Strategy	What it splits	Best for
Data parallelism	Each GPU gets different data batches, same model copy	Small-to-medium models
Tensor parallelism	Splits individual weight matrices across GPUs	Very large layers
Pipeline parallelism	Different layers on different GPUs	Deep models
FSDP / ZeRO	Splits (shards) optimizer states, gradients, and parameters across GPUs — each GPU holds only a piece	Memory-efficient training

Most large training runs combine multiple strategies. For example: data parallelism across nodes, tensor parallelism within each node.

In the 100-book universe, one GPU is enough. Distributed training becomes necessary when the data and model outgrow a single machine.

LLM training from scratch · Slide 26 of 37

26. Algorithm alternatives by stage

Model architecture options

Architecture	What it is	Best for	Weakness
N-gram model	Counts previous word patterns	Easiest baseline	No deep meaning
RNN	Processes tokens sequentially	Historical sequence modeling	Slow, weak long context
LSTM/GRU	Better RNNs with memory gates	Small sequence tasks	Harder to scale
CNN language model	Uses convolutions over text	Fast local pattern learning	Weaker long-range modeling
Encoder-only Transformer	BERT-style masked understanding	Classification, embeddings	Not ideal for open-ended generation
Encoder-decoder Transformer	Input-to-output generation	Translation, summarization	More complex
Decoder-only Transformer	Predicts next token autoregressively	General LLM/chat/code	Standard choice, but attention can be costly
Sparse MoE Transformer	Routes tokens to expert subnetworks	More capacity per compute	Routing/training complexity
State-space models	Mamba-like sequence models	Long sequences, efficient inference	Less universally dominant than Transformers
Hybrid models	Transformer + SSM/MoE	Efficiency + quality tradeoff	Complex

Training objective options

Objective	Formula idea	Best for
Causal LM	Predict next token	Chat, generation, code
Masked LM	Predict hidden tokens	Understanding, embeddings
Span corruption	Reconstruct missing spans	Seq2seq models
Prefix LM	Some tokens bidirectional, some causal	Flexible generation
Multi-token prediction	Predict multiple future tokens	Potential training efficiency/performance
Contrastive learning	Pull similar texts together	Embeddings/retrieval

Attention variants and fine-tuning options

Attention variants

Attention type	Best for
Full attention	Best simple default; every token attends to prior tokens
Sliding-window attention	Long context with lower cost
Sparse attention	Efficient long documents
FlashAttention	Faster exact attention implementation
Multi-query attention	Faster inference
Grouped-query attention	Balance quality and inference speed
Multi-head latent attention	Compresses key/value cache; efficiency-oriented

Fine-tuning options

Method	Best for
Full fine-tuning	Highest control, enough compute
LoRA	Cheap adaptation; fewer trainable parameters
QLoRA	Even cheaper, quantized base model
Adapter tuning	Modular task adapters
Prefix tuning	Small prompt-like trainable vectors
Prompt tuning	Very parameter-efficient

LoRA idea: freeze the pretrained model and inject small trainable low-rank matrices instead of updating everything.

LLM training from scratch · Slide 27 of 37

27. After pretraining: instruction tuning

After next-token pretraining, the model knows the books but may not follow instructions well.

Stage	Teaches
Pretraining	Continue text
Instruction tuning	Answer the user, follow format, summarize, explain, refuse unsafe requests, write code

Create examples like:

Instruction:
Summarize Book 7 in three sentences.

Response:
Book 7 describes...

Then train the model on:

(instruction, ideal answer)Each training example is simply a pair: an instruction, and the answer we want the model to give back.

The loss is still next-token prediction, but now the data format is conversational.

User: Explain why the king left the city.
Assistant: The king left because...

The model learns the assistant style.

Remember: instruction tuning does not replace pretraining. It teaches the pretrained model how to behave as an assistant.

LLM training from scratch · Slide 28 of 37

28. Preference alignment

Instruction tuning teaches the model to imitate examples.

Preference alignment teaches:

Among several possible answers, which one do humans prefer?

Example prompt:

Explain gravity.

Answer	Text
A	Gravity is a force that attracts masses...
B	Gravity is magic dust from the sky.

Human label:

A is better than B.

RLHF

RLHF means reinforcement learning from human feedback. Reinforcement learning is a training approach where the model is rewarded for good outputs and penalized for bad ones — like training a pet with treats.

Train base model
Supervised fine-tune on good answers
Collect human preference rankings
Train a reward model
Use reinforcement learning (typically PPO — Proximal Policy Optimization) to optimize the assistant model toward high reward

DPO

DPO means Direct Preference Optimization.

It skips the separate reward-model-plus-RL loop and directly trains on preferred vs rejected answers.

Prompt: Explain gravity.
Chosen: Gravity is attraction between masses...
Rejected: Gravity is magic dust...

DPO pushes up probability of the chosen answer and pushes down probability of the rejected answer.

Other alignment methods

Method	Key idea
KTO	Uses only good/bad labels instead of pairwise comparisons
ORPO	Combines instruction tuning and alignment in one step
GRPO	Group-based scoring without a separate reward model
Constitutional AI / RLAIF	Uses AI feedback instead of human feedback to generate preferences

The field moves fast. RLHF and DPO capture the core ideas; newer methods refine the process.

LLM training from scratch · Slide 29 of 37

29. Inference: how the trained model writes text

After training, generation works one token at a time.

Prompt:

The cat sat on the

Model predicts probabilities:

Token	Probability
mat	0.72
chair	0.08
floor	0.05
roof	0.02

Choose one token:

mat

Now prompt becomes:

The cat sat on the mat

Predict next token again, choose another token, and repeat.

KV cache

Generating token 50 requires attending to tokens 1–49. Without optimization, the model would recompute key and value vectors for all previous tokens every step.

The KV cache stores key and value vectors from previous tokens so they are computed only once. Each new token only computes its own Q, K, V and attends to the cached keys and values.

Without KV cache	With KV cache
Recompute all K, V every step	Compute K, V once per token, reuse from cache
Cost grows quadratically	Cost grows linearly
Very slow	Standard practice

Every production LLM uses a KV cache during inference.

Decoding options

Method	What it does	Best for
Greedy decoding	Always choose highest probability token	Deterministic, simple
Beam search	Track several likely continuations	Translation, structured output
Sampling	Randomly sample from probabilities	Creative text
Temperature	Controls randomness	Lower = safer, higher = creative
Top-k	Sample only from top k tokens	Avoids weird low-probability tokens
Top-p / nucleus	Sample from smallest set covering p probability	Common creative decoding

Temperature formula

pᵢ = softmax(zᵢ / τ)Divide every score by the temperature τ before softmax. A small τ makes the top choice dominate (safe, predictable); a large τ flattens the odds (more random, more creative).

τ = temperature
lower τ = sharper, more predictable
higher τ = flatter, more random

LLM training from scratch · Slide 30 of 37

30. Quantization: making models smaller

A model with 70 billion FP16 parameters needs ~140 GB of memory just to store the weights. Most devices cannot handle this. Quantization reduces the number of bits per parameter.

Precision	Bits per param	70B model size	Quality
FP16 / BF16	16	~140 GB	Full quality
INT8	8	~70 GB	Minimal quality loss
INT4	4	~35 GB	Some quality loss, but usable

Common methods

Method	How it works
GPTQ	Post-training quantization using calibration data (a small sample of real text that helps decide how to best compress each weight)
AWQ	Protects important weights from aggressive quantization
GGUF	Format for running quantized models locally (llama.cpp)

Quantization is why you can run a 70-billion-parameter model on a laptop. It trades a small amount of accuracy for a large reduction in memory and speed.

LLM training from scratch · Slide 31 of 37

31. What "parameters" mean

When people say:

A model has 7 billion parameters.

They mean:

The neural network contains 7 billion learned numbers.

These are not facts, not words, and not rows in a database.

They are numbers inside matrices.

Examples of parameter locations

Embedding matrix
Attention query matrix
Attention key matrix
Attention value matrix
MLP matrices
Output matrix
Normalization weights

A parameter might be one number like:

0.0371

Training changes this number slightly over and over until the whole network predicts text well.

Remember: parameters are learned numbers, not explicit memories.

LLM training from scratch · Slide 32 of 37

32. Why scale matters

Bigger models can store and transform more patterns.

More data gives more examples.

More compute lets you train bigger models on more data.

But they must be balanced.

Imbalance	Result
Model too big for data	Memorizes / overfits
Dataset too big for model	Model underuses the data
Training run too short	Undertrained model
Training run too long on tiny data	Memorization risk

For the 100-book universe, the optimal model is probably small.

For the real internet-scale universe, the optimal model can be enormous.

Scaling laws

Research (notably the Chinchilla paper, 2022) showed that model size and dataset size should grow together.

The key finding: a model with N parameters should be trained on roughly 20×N tokens for compute-optimal training.

Model parameters	Recommended tokens
1B	~20B tokens
10B	~200B tokens
70B	~1.4T tokens

In the 100-book universe, the token budget is small, so the model should also be small — matching what we said in slide 23.

Remember: scale helps only when model size, data size, and compute are balanced.

LLM training from scratch · Slide 33 of 37

33. Beyond text: multimodal models

This entire deck describes training on text. Modern LLMs also process images, audio, video, and code.

The core idea is the same: convert inputs into token-like vectors, then feed them through the same Transformer.

Modality	How it becomes tokens
Text	Tokenizer (BPE, etc.)
Images	Split into patches, each patch becomes a vector (ViT-style)
Audio	Convert to spectrogram, split into frames
Video	Sample frames, treat each frame like an image
Code	Same text tokenizer, but trained on code data

The Transformer does not care what the vectors represent. If you can convert an input into a sequence of vectors, the Transformer can process it.

In our 100-book universe, the books are pure text. But the same training principles apply if we added 100 illustrated books instead.

LLM training from scratch · Slide 34 of 37

34. Complete tiny example

Suppose the entire universe has only these "books":

Book 1: The cat sat.
Book 2: The dog slept.
Book 3: The cat slept.
Book 4: The dog sat.

Vocabulary:

["The", "cat", "dog", "sat", "slept", "."]

Token IDs:

The = 0
cat = 1
dog = 2
sat = 3
slept = 4
. = 5

Training sequences

Input	Target
The	cat
The cat	sat
The cat sat	.
The	dog
The dog	slept
The dog slept	.
The cat	slept
The dog	sat

At first, the model is random.

After training:

Input	Likely next tokens
The cat	sat 0.45; slept 0.45; dog 0.02; . 0.01
The dog	sat 0.45; slept 0.45; cat 0.02; . 0.01

The model learned that "The" is followed by animal words, "cat" and "dog" are similar, "sat" and "slept" are possible verbs, and "." often ends the sentence.

Scale this from 4 tiny books to 100 books, then to trillions of tokens, and the same idea becomes an LLM.

LLM training from scratch · Slide 35 of 37

35. Final compressed summary

An LLM is trained like this:

Take all text.
Tokenize it.
Convert tokens to vectors.
Feed token vectors through many Transformer layers.
At each position, predict the next token.
Compare prediction to the real next token.
Compute cross-entropy loss.
Backpropagate the error.
Use AdamW-like optimization to update weights.
Use mixed-precision and distributed training to scale.
Repeat until validation performance stops improving.
Optionally fine-tune on instructions.
Optionally align with human preferences.
Generate text one token at a time.
Optionally quantize for efficient deployment.
Extend to images, audio, and other modalities with the same architecture.

The core training objective is:

minimize over θ: −Σᵢ log P_θ(xᵢ | x_<i)

Plain English:

Adjust the model's internal numbers so the actual next token from the books becomes more probable.

That is the heart of LLM training.

LLM training from scratch · Slide 36 of 37

36. From predictor to reasoner

Everything so far builds a model that predicts the next token in one quick pass. Ask it a hard math or logic problem and it often blurts out a confident — but wrong — answer.

A reasoning model is the exact same Transformer, trained to think before it answers.

The key shift: think, then answer

A normal model goes straight to the answer:

Question → Answer

A reasoning model writes a long private train of thought first:

Question → step, step, step, check, backtrack… → Answer

That middle part is a chain of thought — the model "talking to itself" on a scratchpad, working through the problem one step at a time, before committing to a final answer.

Why this makes it smarter

Remember: each token the model generates is one burst of computation. By producing hundreds of reasoning tokens before answering, the model spends far more compute on a hard problem than a single-pass answer ever could.

more thinking tokens → more computation → better answers on hard problemsLetting the model write more steps before answering gives it more chances to work things out — like a student using scratch paper instead of answering instantly.

This is called test-time compute: you make the model smarter by letting it think longer at answer-time, not only by making it bigger at training-time.

How it is trained

Start from a normal instruction-tuned model, then add a reasoning stage:

Step	What happens
Show worked examples	Train on answers that include step-by-step working, not just the final result.
Let it attempt hard problems	On questions with a checkable answer (math, code), the model generates many full solution attempts.
Reward only correct answers	Give a point when the final answer is actually right, and reinforce the whole thinking path that got there.
Repeat millions of times	The model gradually discovers which styles of thinking earn points.

reward = 1 if final answer is correct, else 0No human grades the individual steps. The model only earns a point when it lands on the right final answer — so it has to figure out good reasoning on its own.

This is reinforcement learning on verifiable rewards: because math and code answers can be checked automatically, no human has to label the reasoning. The model effectively teaches itself.

What emerges — by itself

The surprising part: nobody hand-codes these behaviors. With enough practice on checkable problems, the model spontaneously starts to:

Self-correct — "wait, that step is wrong, let me redo it."
Break problems down — split a big question into smaller, easier pieces.
Explore options — try several approaches and keep the one that works.
Think longer on harder problems — spend more steps when a question is tough.

In the 100-book universe

Instead of instantly guessing the next plot point, the model first writes a scratchpad: "the gun appeared in chapter 1, the character is furious, an earlier promise was broken — so the likely next event is…" — and only then answers. We reward the chains that correctly predict held-out passages, and the model learns to reason about the story instead of pattern-matching.

The trade-off: reasoning is slower and more expensive — the model generates many hidden "thinking" tokens for every answer. The payoff is far higher accuracy on math, code, logic, and planning.

Same architecture, new skill: a reasoning model is not a different machine — it is a next-token predictor that has been rewarded for thinking out loud until it gets hard things right.

LLM training from scratch · Slide 37 of 37

37. Glossary

Every key term from the deck, in plain language. Start typing to filter — or hover any underlined word on the other slides to see its definition without leaving the page.