arxiv preprint - LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

In this episode, we discuss LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models by Yanwei Li, Chengyao Wang, Jiaya Jia. The paper introduces a new approach named LLaMA-VID for improving the processing of lengthy videos in Vision Language Models (VLMs) by using a dual token system: a context token and a content token. The context token captures the overall image context while the content token targets specific visual details in each frame, which tackles the issue of computational strain in handling extended video content. LLaMA-VID enhances VLM capabilities for long-duration video understanding and outperforms existing methods in various video and image benchmarks, with the code made available online. Code is avail-
able at https://github.com/dvlab-research/LLaMA-VID.

arxiv preprint – LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models