ArXiv Preprint – S-LoRA: Serving Thousands of Concurrent LoRA Adapters


In this episode we discuss S-LoRA: Serving Thousands of Concurrent LoRA Adapters
by Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica. The paper introduces S-LoRA, a system for efficiently serving a large number of Low-Rank Adaptation (LoRA) language model adapters by storing them in memory and using optimized memory management and computation strategies. S-LoRA utilizes Unified Paging for managing memory and custom CUDA kernels for improved tensor parallelism, resulting in up to 4 times higher throughput and serving capacity for thousands of adapters on a single or multiple GPUs compared to current state-of-the-art libraries. The system allows for scalable and customized fine-tuning services, and the authors have made their code publicly available.


Posted

in

by

Tags: