arxiv Preprint - EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding

In this episode we discuss EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding
by Karttikeya Mangalam, Raiymbek Akshulakov, Jitendra Malik. The paper presents EgoSchema, a benchmark dataset and evaluation metric for assessing the long-form video language understanding capabilities of vision and language systems. The dataset consists of over 5000 multiple choice question-answer pairs based on 250 hours of real video data, and the questions require selecting the correct answer from five options based on a three-minute video clip. The authors highlight that existing video understanding datasets lack long temporal structures, and they show that state-of-the-art video and language models have limitations in long-term video understanding.

arxiv Preprint – EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding