Day 1Lecture Notes
Intro & Tokenization
Today I learned about Byte Pair Encoding (BPE), the algorithm used by most modern language models for tokenization. Unicode handling: Breaking text into bytes first, then merging Vocabulary size trade-offs: More tokens = shorter sequences but larger embedding matrix Special tokens: [BOS], [EOS], [PAD], [UNK] The basic algorithm: 1. Start with character-level…
Tokenization·3h
