CS336 × 33.6

Day 4Lecture Notes

Lectures 3 & 4

Pre-Norm works better, even without warmup. Also doesn't need gradient attenuation Normalization operation optimization is important because of memory movement not just FLOPS Activations: GeLU - CDF of gaussian * ReLU, makes it more differentiable Gated entry wise (using the X) the inner part of the MLP GeGLU is gated version GeLU Gated units create slightly…

Day 3Lecture Notes

Lecture 2

Primitives Tensors -> Models -> Optimizers -> Training Loops Efficiency of use of resources Days to train a model = 6\no. of model parameters\no. of tokens (i.e. flops needed)/h100_flops_p_sec\1024\60\60\24\*mfu (i.e. flops per day) float32 - single precision, good for scientific computing, not needed for DL. 4 bytes Memory = size \* datatype float16 - small…

Day 1Lecture Notes

Intro & Tokenization

Runnable lecture with env. variables 5 assignments, good for research engineering muscles Prototype locally, then benchmark on clusters AI tool use - self declared - only in chat mode for pre-req and errors Researchers are disconnected from underlying implementation Abstraction improves productivity but abstractions are leaky Scope for fundamental research W…

Working through Stanford CS336 — Language Modeling from Scratch — in 33.6 days.

Lectures 3 & 4

Lectures 3 & 4

Lecture 2

Intro & Tokenization