Arxiv paper – Slow-Fast Architecture for Video Multi-Modal Large Language Models


In this episode, we discuss Slow-Fast Architecture for Video Multi-Modal Large Language Models by Min Shi, Shihao Wang, Chieh-Yun Chen, Jitesh Jain, Kai Wang, Junjun Xiong, Guilin Liu, Zhiding Yu, Humphrey Shi. The paper presents a slow-fast architecture for video-based multi-modal large language models that uses a dual-token system to balance temporal resolution and spatial detail efficiently. “Fast” tokens provide a compressed overview of the video, while “slow” tokens deliver detailed, instruction-aware visual information, allowing the model to handle more frames with minimal extra computation. Experimental results show that this approach significantly outperforms existing methods, enhancing input capacity and achieving state-of-the-art performance among similar-sized models.


Posted

in

by

Tags: