NeurIPS 2024 - Moving Off-the-Grid: Scene-Grounded Video Representations

In this episode, we discuss Moving Off-the-Grid: Scene-Grounded Video Representations by Sjoerd van Steenkiste, Daniel Zoran, Yi Yang, Yulia Rubanova, Rishabh Kabra, Carl Doersch, Dilara Gokay, Joseph Heyward, Etienne Pot, Klaus Greff, Drew A. Hudson, Thomas Albert Keck, Joao Carreira, Alexey Dosovitskiy, Mehdi S. M. Sajjadi, Thomas Kipf. The paper introduces the Moving Off-the-Grid (MooG) model, which improves video representation by detaching representation structures from fixed spatial or spatio-temporal grids, addressing the limitations of traditional models in handling dynamic scene changes. MooG leverages cross-attention and positional embeddings to track and consistently represent scene elements as they move, using a self-supervised next frame prediction objective during training. The model demonstrates superior performance in various vision tasks, showcasing its potential as a robust alternative to conventional methods.

NeurIPS 2024 – Moving Off-the-Grid: Scene-Grounded Video Representations