arxiv preprint – CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models

In this episode, we discuss CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models by Je-Yong Lee, Donghyun Lee, Genghan Zhang, Mo Tiwari, Azalia Mirhoseini. The paper presents “Contextually-Aware Thresholding for Sparsity (CATS),” a method intended to reduce the operational costs of Large Language Models (LLMs) by increasing activation sparsity while maintaining high performance levels. Unlike traditional sparsity-enhancing approaches that degrade model performance, CATS uses a novel non-linear activation function that achieves up to 50% sparsity with minimal loss in effectiveness. Furthermore, CATS improves convergence and performance on tasks when fine-tuning, and its implementation via a custom GPU kernel yields about a 15% reduction in inference time specifically on models like Llama-7B and Mistral-7B.


Posted

in

by

Tags: