CVPR 2023 - OmniMAE: Single Model Masked Pretraining on Images and Videos

In this episode we discuss OmniMAE: Single Model Masked Pretraining on Images and Videos
by Authors:
– Rohit Girdhar
– Alaaeldin El-Nouby
– Mannat Singh
– Kalyan Vasudev Alwala
– Armand Joulin
– Ishan Misra

Affiliation:
– FAIR, Meta AI. The paper discusses how a common architecture can be used to train a single unified model for multiple visual modalities, namely images and videos, using masked autoencoding. The proposed vision transformer model achieves comparable or better visual representations than single-modality representations on both image and video benchmarks, without requiring any labeled data. Additionally, the model can be trained efficiently by dropping a large proportion of image and video patches. The proposed model achieves new state-of-the-art performance on the ImageNet and Something Something-v2 video benchmarks.

CVPR 2023 – OmniMAE: Single Model Masked Pretraining on Images and Videos