Arxiv Preprint - ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

In this episode we discuss ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
by Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, Dahua Lin. The ShareGPT4V dataset, with 1.2 million rich descriptive captions, has been created to enhance modality alignment in large multi-modal models (LMMs), offering greater diversity and information content across various domains. When integrated into the Supervised Fine-Tuning (SFT) phase, ShareGPT4V significantly improved performances of advanced models on benchmarks, showcasing its utility in enriching LMMs. Additionally, utilizing ShareGPT4V data in both pre-training and SFT processes led to the development of ShareGPT4V-7B, a streamlined and high-performing LMM, demonstrating the dataset’s potential to propel multi-modal research.

Arxiv Preprint – ShareGPT4V: Improving Large Multi-Modal Models with Better Captions