arxiv preprint – Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

In this episode, we discuss Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs by Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan. The paper presents Ferret-UI, a new multimodal large language model tailored for interpreting and interacting with mobile user interface screens, which overcomes common challenges through a novel approach of dividing screens into sub-images for enhanced detail processing. The model has been trained on a variety of UI-focused tasks with improved instruction-following and region annotations, enhancing its abilities in tasks like icon recognition and conversational interaction. Ferret-UI demonstrates superior performance in UI comprehension and task execution compared to existing models, establishing a new benchmark for the evaluation of MLLMs in the context of user interface understanding.



