In this episode, we discuss NEPTUNE: THE LONG ORBIT TO BENCHMARKING LONG VIDEO UNDERSTANDING by The authors of the paper “NEPTUNE: THE LONG ORBIT TO BENCHMARKING LONG VIDEO UNDERSTANDING” are: – Arsha Nagrani – Mingda Zhang – Ramin Mehran – Rachel Hornung – Nitesh Bharadwaj Gundavarapu – Nilpa Jha – Austin Myers – Xingyi Zhou – Boqing Gong – Cordelia Schmid – Mikhail Sirotenko – Yukun Zhu – Tobias Weyand. The paper introduces “Neptune,” a semi-automatic system designed to generate complex question-answer-decoy sets from long video content to enhance comprehension tasks typically limited to short clips. Leveraging large models like Vision-Language Models and Large Language Models, Neptune creates detailed, time-aligned captions and intricate QA sets for videos up to 15 minutes long, aiming to improve annotation efficiency. The dataset emphasizes multimodal reasoning and introduces the GEM metric for evaluating responses, revealing current long video models’ weaknesses in understanding temporal and state changes.
arxiv preprint – NEPTUNE: THE LONG ORBIT TO BENCHMARKING LONG VIDEO UNDERSTANDING
by
Tags: