CVPR 2023 - Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training

In this episode we discuss Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training
by Filip Radenovic, Abhimanyu Dubey, Abhishek Kadian, Todor Mihaylov, Simon Vandenhende, Yash Patel, Yi Wen, Vignesh Ramanathan, Dhruv Mahajan. The paper discusses improvements to the contrastive pre-training pipeline for vision-language models used in zero-shot recognition problems. The authors propose a filtering strategy called CAT to reduce dataset size, an approach called Concept Distillation to leverage strong unimodal representations and modify the traditional contrastive alignment objective with an importance-sampling approach to up-sample the importance of hard-negatives without adding complexity. Their Distilled and Hard-negative Training (DiHT) approach improves performance on 20 tasks in a zero-shot benchmark of 29 tasks and bridges the gap between zero-shot and few-shot performance in linear probing. Demo code is available on GitHub.

CVPR 2023 – Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training