arxiv preprint – E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding

In this episode, we discuss E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding by Ye Liu, Zongyang Ma, Zhongang Qi, Yang Wu, Ying Shan, Chang Wen Chen. The paper introduces E.T. Bench, a comprehensive benchmark for fine-grained event-level video understanding, evaluating Video-LLMs across 12 tasks and 7K videos. It highlights the challenges these models face in accurately understanding and grounding events within videos. To improve performance, E.T. Chat and an instruction-tuning dataset, E.T. Instruct 164K, are proposed, enhancing models’ abilities and underlining the necessity for advanced datasets and models in temporal and multi-event video-language tasks.


Posted

in

by

Tags: