arxiv preprint - A Simple LLM Framework for Long-Range Video Question-Answering

In this episode, we discuss A Simple LLM Framework for Long-Range Video Question-Answering by Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, Gedas Bertasius. The LLoVi framework innovates in long-range video question-answering (LVQA) by combining visual captioners with Large Language Models (LLMs) such as GPT-3.5 or GPT-4, foregoing complex long-range video modeling structures. Short video clips from a long video are captioned and these captions are then synthesized by an LLM to answer questions over the entire video length, proving more effective at LVQA than previous methods. In benchmarks, LLoVi notably outperformed previous best-performing approaches on several datasets, such as EgoSchema, NeXT-QA, IntentQA, and NeXT-GQA, and the code for LLoVi will be made publicly available.

arxiv preprint – A Simple LLM Framework for Long-Range Video Question-Answering