arxiv Preprint - HyperAttention: Long-context Attention in Near-Linear Time

In this episode we discuss HyperAttention: Long-context Attention in Near-Linear Time
by Insu Han, Rajesh Jayaram, Amin Karbasi, Vahab Mirrokni, David P. Woodruff, Amir Zandieh. The paper introduces “HyperAttention,” an approximate attention mechanism for handling long contexts in Large Language Models (LLMs). It proposes two parameters to measure problem difficulty and presents a linear time sampling algorithm for attention. Empirical results demonstrate that HyperAttention outperforms existing methods, significantly speeding up inference time while maintaining comparable perplexity. The paper concludes by highlighting the scalability limitations of exact computation in attention layers and discussing the potential of HyperAttention to overcome these limitations.

arxiv Preprint – HyperAttention: Long-context Attention in Near-Linear Time