arxiv preprint - LongNet: Scaling Transformers to 1,000,000,000 Tokens

In this episode we discuss LongNet: Scaling Transformers to 1,000,000,000 Tokens
by Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Furu Wei. The paper introduces LONGNET, a variant of the Transformer model that addresses the challenge of scaling sequence length in large language models. LONGNET utilizes dilated attention to exponentially expand the attentive field as the distance between tokens grows, offering advantages such as linear computation complexity, logarithmic dependency between tokens, and the ability to serve as a distributed trainer for extremely long sequences. Experimental results demonstrate that LONGNET performs well on long-sequence modeling and general language tasks, allowing for the modeling of very long sequences like entire corpora or the entire Internet.

arxiv preprint – LongNet: Scaling Transformers to 1,000,000,000 Tokens