ICCV 2023 – Verbs in Action: Improving verb understanding in video-language models


In this episode we discuss Verbs in Action: Improving verb understanding in video-language models
by Liliane Momeni, Mathilde Caron, Arsha Nagrani, Andrew Zisserman, Cordelia Schmid. The paper proposes a Verb-Focused Contrastive (VFC) framework to address the limited understanding of verbs in video-language models. The framework utilizes pre-trained large language models (LLMs) to generate hard negative captions by changing only the verb while keeping the context intact. The method achieves state-of-the-art results in zero-shot performance on three downstream tasks: video-text matching, video question-answering, and video classification.


Posted

in

by

Tags: