arxiv preprint - OneLLM: One Framework to Align All Modalities with Language

In this episode we discuss OneLLM: One Framework to Align All Modalities with Language
by Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, Xiangyu Yue. The paper introduces OneLLM, a multimodal large language model that unifies the encoding of eight different modalities to language via a single framework. It uses a new image projection module and a universal projection module for multimodal alignment, extending the model’s capability to progressively align more modalities. OneLLM is demonstrated to excel in various multimodal tasks across 25 benchmarks and is supplementarily supported by a specially curated multimodal instruction dataset with 2 million items, with resources accessible online.

arxiv preprint – OneLLM: One Framework to Align All Modalities with Language