Chapter 1: Transformers, of course

Mainly relying on random resources on transformers online (here and here), I started to take notes on how they work. I will not focus too much on the whole architecture as a whole but have mainly tried to focus on the attention mechanism and MLP and how these components act on the residual stream of a transformer. I will therefore leave out for now topics such as embeddings and other trying to implement this using Einstein notation (here a Python tutorial) for the various matrix operations, which I knew existed but had never used in practice. Below are handwritten notes to that effect.