arxiv preprint – Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video

In this episode, we discuss Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video by Shashanka Venkataramanan, Mamshad Nayeem Rizve, João Carreira, Yuki M. Asano, Yannis Avrithis. The paper presents two innovations in self-supervised learning: a new dataset called “Walking Tours,” which features high-resolution, long duration, first-person videos ideal for self-supervision, and a novel pretraining method called DORA which uses transformer cross-attention to track and learn object recognition in videos. This method diverges from adapting image-based pretraining to videos by instead focusing on tracking objects over time. The researchers found that their approach, combining the Walking Tours dataset with DORA, performed comparably to ImageNet on various image and video recognition tasks, showcasing the efficiency of their method.


Posted

in

by

Tags: