arxiv preprint - Learning Video Representations from Large Language Models

In this episode, we discuss Learning Video Representations from Large Language Models by Yue Zhao, Ishan Misra, Philipp Krähenbühl, Rohit Girdhar. The LAVILA method introduces a novel technique to enhance video-language representations by utilizing pre-trained Large Language Models (LLMs) to generate automatic video narrations. By using these auto-generated narrations, LAVILA achieves more detailed coverage, better alignment between video and text, and greater diversity in the generated text, resulting in improved video-text embedding. This approach surpasses existing benchmarks significantly in both zero-shot and finetuned scenarios, with remarkable gains in video classification and retrieval tasks, even when trained with fewer data compared to baselines.

arxiv preprint – Learning Video Representations from Large Language Models