CVPR 2023 - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

In this episode we discuss MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
by Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yi Yang, Mike Zheng Shou. The paper introduces a model called MIST for long-form VideoQA, which addresses challenges like multi-event reasoning, interactions among visual concepts, and causality reasoning. MIST decomposes spatial-temporal self-attention, handles different granularities of visual concepts, and performs iterative selection and attention across layers. Experimental results demonstrate that MIST achieves state-of-the-art performance while being computationally efficient and interpretable.

CVPR 2023 – MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering