Working through Stanford CS336 — Language Modeling from Scratch — in 33.6 days.

Documenting my progress below

Day 4 of 33.6
Basics
Day 1 · Apr 13, 2026
Intro & Tokenization
3h
Day 2 · Apr 14, 2026
Gap — no note logged
Day 3 · Apr 15, 2026
Lecture 2
3h
Day 4 · Apr 16, 2026
Lectures 3 & 4
7h
Day 5 · Apr 17, 2026
Coming up
Day 6 · Apr 18, 2026
Coming up
Day 7 · Apr 19, 2026
Coming up
Systems
Day 8 · Apr 20, 2026
Coming up
Day 9 · Apr 21, 2026
Coming up
Day 10 · Apr 22, 2026
Coming up
Day 11 · Apr 23, 2026
Coming up
Day 12 · Apr 24, 2026
Coming up
Day 13 · Apr 25, 2026
Coming up
Day 14 · Apr 26, 2026
Coming up
Scaling Laws
Day 15 · Apr 27, 2026
Coming up
Day 16 · Apr 28, 2026
Coming up
Day 17 · Apr 29, 2026
Coming up
Day 18 · Apr 30, 2026
Coming up
Day 19 · May 1, 2026
Coming up
Day 20 · May 2, 2026
Coming up
Day 21 · May 3, 2026
Coming up
Data
Day 22 · May 4, 2026
Coming up
Day 23 · May 5, 2026
Coming up
Day 24 · May 6, 2026
Coming up
Day 25 · May 7, 2026
Coming up
Day 26 · May 8, 2026
Coming up
Day 27 · May 9, 2026
Coming up
Day 28 · May 10, 2026
Coming up
Alignment
Day 29 · May 11, 2026
Coming up
Day 30 · May 12, 2026
Coming up
Day 31 · May 13, 2026
Coming up
Day 32 · May 14, 2026
Coming up
Day 33 · May 15, 2026
Coming up
Day 34 · Sprint end
0.6-day finish line
TodayLecture Notes

Lectures 3 & 4

7h

All Notes
Day 4Lecture Notes

Lectures 3 & 4

Pre-Norm works better, even without warmup. Also doesn't need gradient attenuation Normalization operation optimization is important because of memory movement not just FLOPS Activations: GeLU - CDF of gaussian * ReLU, makes it more differentiable Gated entry wise (using the X) the inner part of the MLP GeGLU is gated version GeLU Gated units create slightly…

7h
Lectures 3 & 4
Day 3Lecture Notes

Lecture 2

Primitives Tensors -> Models -> Optimizers -> Training Loops Efficiency of use of resources Days to train a model = 6\no. of model parameters\no. of tokens (i.e. flops needed)/h100_flops_p_sec\1024\60\60\24\*mfu (i.e. flops per day) float32 - single precision, good for scientific computing, not needed for DL. 4 bytes Memory = size \* datatype float16 - small…

3h
Lecture 2
Day 1Lecture Notes

Intro & Tokenization

Runnable lecture with env. variables 5 assignments, good for research engineering muscles Prototype locally, then benchmark on clusters AI tool use - self declared - only in chat mode for pre-req and errors Researchers are disconnected from underlying implementation Abstraction improves productivity but abstractions are leaky Scope for fundamental research W…

3h
Intro & Tokenization