CVPR 2023 – ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding


In this episode we discuss ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding
by Le Xue, Mingfei Gao, Chen Xing, Roberto Martín-Martín, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, Silvio Savarese. The paper introduces ULIP, a framework that learns a unified representation of images, texts, and 3D point clouds to overcome the limited recognition capabilities of current 3D models due to datasets with a small number of annotated data and a pre-defined set of categories. ULIP pre-trains with object triplets from the three modalities, using a pre-trained vision-language model to overcome the shortage of training triplets, and then learns a 3D representation space aligned with the common image-text space using synthesized triplets. Results show that ULIP improves the performance of multiple recent 3D backbones, achieving state-of-the-art performance in both standard and zero-shot 3D classification on several datasets.


Posted

in

by

Tags: