arxiv preprint - VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

In this episode, we discuss VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos by Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, Mohit Bansal. The paper introduces VideoTree, a novel framework that enhances the efficiency and accuracy of long-video question answering by selectively extracting and hierarchically organizing frames based on their relevance to the query. Unlike traditional methods that rely on dense and often redundant sampling of frames for LLM-based reasoning, VideoTree employs a dynamic, adaptive approach to identify and caption keyframes, forming a tree structure that reflects varying levels of detail where needed. Experiments demonstrate significant performance improvements and reduced inference times on benchmarks like EgoSchema, NExT-QA, and IntentQA.

arxiv preprint – VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos