arxiv preprint - Better & Faster Large Language Models via Multi-token Prediction

In this episode, we discuss Better & Faster Large Language Models via Multi-token Prediction by Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve. The paper “Better & Faster Large Language Models via Multi-token Prediction” introduces a novel training methodology for large language models (LLMs) by predicting multiple future tokens simultaneously rather than the traditional single next-token prediction. This technique utilizes multiple independent output heads on a shared model trunk to predict several tokens at once, enhancing sample efficiency and model performance on generative tasks without increasing training times. The models trained using this method not only show improved results in tasks like coding but also benefit from faster inference times, up to three times quicker than traditional models.

arxiv preprint – Better & Faster Large Language Models via Multi-token Prediction