ArXiv Preprint - Birth of a Transformer: A Memory Viewpoint

In this episode we discuss Birth of a Transformer: A Memory Viewpoint
by The authors of the paper are Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Hervé Jegou and Léon Bottou.. The paper titled “Birth of a Transformer: A Memory Viewpoint” delves into the internal workings of large language models based on transformers. The authors introduce a synthetic dataset to study how transformers balance global knowledge and context-specific knowledge. The study finds that two-layer transformers use an induction head mechanism to predict context-specific bigrams, and the authors introduce a natural model for individual weight matrices as associative memories. Through their empirical study, the authors provide theoretical insights on how gradients enable the learning of weight matrices during training and analyze the role of data-distributional properties.

ArXiv Preprint – Birth of a Transformer: A Memory Viewpoint