arxiv preprint - Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

In this episode, we discuss Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens by Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi, Hannaneh Hajishirzi. The paper introduces an improved n-gram language model named “Infini-gram,” which scales to 1.4 trillion tokens and has the capacity to use n-grams of arbitrary length. The authors develop a suffix array-powered engine called infini-gram that calculates probabilities for these extended n-grams quickly, without the need for pre-computing count tables. This new framework demonstrated its utility by enhancing the performance of neural large language models and revealing limitations in machine-generated text, and the authors have made the engine available as an open-source tool for further research.

arxiv preprint – Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens