In this episode, we discuss LITA: Language Instructed Temporal-Localization Assistant by De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, Jan Kautz. The paper introduces the Language Instructed Temporal-Localization Assistant (LITA), which tackles the issue of temporal localization in Large Language Models (LLMs) processing video content, where they struggle to identify “when” an event occurs in a video. LITA incorporates time tokens for better temporal representation, uses a SlowFast token architecture for finer temporal resolution, and emphasizes training on temporal localization data, introducing a new task with its dataset (ActivityNet-RTL). The implementation of LITA demonstrates strong performance improvements in temporal localization tasks and video-based text generation, with the code available on GitHub for public use.
arxiv preprint – LITA: Language Instructed Temporal-Localization Assistant
by
Tags: