NeurIPS 2022 - Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

In this episode we discuss Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
by Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, Shih-Fu Chang, Mohit Bansal, Heng Ji. VidIL is a few-shot video-language learner that combines image and language models to generalize to different video-to-text tasks with limited examples. It translates video content into frame captions, object, attribute, and event phrases, and combines them into a temporal-aware template. The language model is then prompted with a few in-context examples to generate a target output. Experimental results show that VidIL outperforms supervised models on video future event prediction.

NeurIPS 2022 – Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners