arxiv preprint - Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

In this episode, we discuss Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models by Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, Jiaya Jia. The paper introduces Mini-Gemini, a framework aimed at improving Vision Language Models (VLMs) by addressing the performance gap with advanced models like GPT-4. Mini-Gemini focuses on three main enhancements: incorporating high-resolution visual tokens without added computational cost, creating a high-quality dataset for refined image understanding and reasoning, and facilitating VLMs to support diverse tasks such as image understanding and generation simultaneously. The framework, compatible with various large language models ranging from 2B to 34B parameters, has shown superior performance in zero-shot benchmarks and is available for public use. Project page: https://mini-gemini.github.io/

arxiv preprint – Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models