Category: Uncategorized

arxiv preprint – Graph-enhanced Large Language Models in Asynchronous Plan Reasoning

In this episode, we discuss Graph-enhanced Large Language Models in Asynchronous Plan Reasoning by Fangru Lin, Emanuele La Malfa, Valentin Hofmann, Elle Michelle Yang, Anthony Cohn, Janet B. Pierrehumbert. The paper investigates how well large language models (LLMs) like GPT-4 and LLaMA-2 handle reasoning about asynchronous plans and finds that they perform poorly without visual…

July 31, 2024
arxiv preprint – LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

In this episode, we discuss LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference by Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi. The paper introduces LazyLLM, a method that selectively computes only the essential token’s Key-Value (KV) cache for next token prediction during the prefilling and decoding stages of…

July 30, 2024
arxiv preprint – OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person

In this episode, we discuss OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person by Ke Sun, Jian Cao, Qi Wang, Linrui Tian, Xindi Zhang, Lian Zhuo, Bang Zhang, Liefeng Bo, Wenbo Zhou, Weiming Zhang, Daiheng Gao. Virtual Try-On (VTON) technology faces challenges in generating high-fidelity and consistent images. While existing diffusion models…

July 29, 2024
arxiv preprint – DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM

In this episode, we discuss DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM by Yixuan Wu, Yizhou Wang, Shixiang Tang, Wenhao Wu, Tong He, Wanli Ouyang, Philip Torr, Jian Wu. DetToolChain introduces a prompting toolkit and a Chain-of-Thought methodology to enhance zero-shot object detection capabilities in multimodal large language models like GPT-4V…

July 26, 2024
arxiv preprint – Conditioned Language Policy: A General Framework for Steerable Multi-Objective Finetuning

In this episode, we discuss Conditioned Language Policy: A General Framework for Steerable Multi-Objective Finetuning by Kaiwen Wang, Rahul Kidambi, Ryan Sullivan, Alekh Agarwal, Christoph Dann, Andrea Michi, Marco Gelmi, Yunxuan Li, Raghav Gupta, Avinava Dubey, Alexandre Ramé, Johan Ferret, Geoffrey Cideron, Le Hou, Hongkun Yu, Amr Ahmed, Aranyak Mehta, Léonard Hussenot, Olivier Bachem, Edouard…

July 23, 2024
arxiv preprint – Chameleon: Mixed-Modal Early-Fusion Foundation Models

In this episode, we discuss Chameleon: Mixed-Modal Early-Fusion Foundation Models by Chameleon Team. The paper introduces Chameleon, a family of models designed for seamless understanding and generating both images and text in any sequence. It achieves state-of-the-art performance in several tasks, including image captioning and text generation, and demonstrates competence in mixed-modal outputs. Notably, Chameleon…

July 22, 2024
arxiv preprint – Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

In this episode, we discuss Goldfish: Vision-Language Understanding of Arbitrarily Long Videos by Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Mingchen Zhuge, Jian Ding, Deyao Zhu, Jürgen Schmidhuber, Mohamed Elhoseiny. The paper introduces Goldfish, a methodology designed to efficiently comprehend videos of any length by employing a retrieval mechanism that selects top-k relevant video…

July 18, 2024
arxiv preprint – Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

In this episode, we discuss Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity by Santiago Pascual, Chunghsin Yeh, Ioannis Tsiamas, Joan Serrà. The paper introduces MaskVAT, a video-to-audio generative model that utilizes a masked generative model alongside a high-quality general audio codec to achieve superior audio quality, semantic matching, and temporal synchronization. MaskVAT effectively addresses the…

July 17, 2024
arxiv preprint – Human-like Episodic Memory for Infinite Context LLMs

In this episode, we discuss Human-like Episodic Memory for Infinite Context LLMs by Zafeirios Fountas, Martin A Benfeghoul, Adnan Oomerjee, Fenia Christopoulou, Gerasimos Lampouras, Haitham Bou-Ammar, Jun Wang. The paper introduces EM-LLM, an approach that enhances large language models (LLMs) by incorporating principles of human episodic memory and event cognition, enabling them to manage extensive…

July 15, 2024
arxiv preprint – Learning to (Learn at Test Time): RNNs with Expressive Hidden States

In this episode, we discuss Learning to (Learn at Test Time): RNNs with Expressive Hidden States by Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin. The paper introduces Test-Time Training (TTT) layers, a new type of sequence modeling layer…

July 12, 2024
arxiv preprint – Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions

In this episode, we discuss Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions by Yu-Guan Hsieh, Cheng-Yu Hsieh, Shih-Ying Yeh, Louis Béthune, Hadi Pour Ansari, Pavan Kumar Anasosalu Vasu, Chun-Liang Li, Ranjay Krishna, Oncel Tuzel, Marco Cuturi. The paper introduces a new annotation strategy termed graph-based captioning (GBC) that uses labelled graph structures to…

July 11, 2024
arxiv preprint – Evaluating Human Alignment and Model Faithfulness of LLM Rationale

In this episode, we discuss Evaluating Human Alignment and Model Faithfulness of LLM Rationale by Mohsen Fayyaz, Fan Yin, Jiao Sun, Nanyun Peng. The paper investigates how effectively large language models (LLMs) can explain their decisions through rationales extracted from input texts. It compares two types of rationale extraction methods—attribution-based and prompting-based—finding that prompting-based rationales…

July 9, 2024
arxiv preprint – Detection and Measurement of Syntactic Templates in Generated Text

In this episode, we discuss Detection and Measurement of Syntactic Templates in Generated Text by Chantal Shaib, Yanai Elazar, Junyi Jessy Li, Byron C. Wallace. The paper investigates syntactic features in text generated by large language models (LLMs), revealing higher rates of templated text in these models compared to human-generated text. It finds that a…

July 8, 2024
arxiv preprint – From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data

In this episode, we discuss From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data by Zheyang Xiong, Vasilis Papageorgiou, Kangwook Lee, Dimitris Papailiopoulos. This paper addresses the challenge Large Language Models (LLMs) face with long-context information retrieval and reasoning. The authors propose finetuning LLMs using a synthetic dataset…

July 1, 2024
arxiv preprint – MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

In this episode, we discuss MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning by Xiangyu Zhao, Xiangtai Li, Haodong Duan, Haian Huang, Yining Li, Kai Chen, Hua Yang. The study presents MG-LLaVA, a multi-modal large language model designed to process both low-resolution and high-resolution images along with object-centric features for improved perception tasks. It includes a high-resolution…

June 27, 2024
arxiv preprint – 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

In this episode, we discuss 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities by Roman Bachmann, Oğuzhan Fatih Kar, David Mizrahi, Ali Garjani, Mingfei Gao, David Griffiths, Jiaming Hu, Afshin Dehghan, Amir Zamir. The paper presents a novel any-to-any model that significantly extends the capabilities of existing multimodal and multitask foundation models…

June 26, 2024
arxiv preprint – VideoLLM-online: Online Video Large Language Model for Streaming Video

In this episode, we discuss VideoLLM-online: Online Video Large Language Model for Streaming Video by Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Zheng Shou. The paper discusses the development of the Learning-In-Video-Stream (LIVE) framework, which improves large multimodal models’ ability to handle…

June 25, 2024
arxiv preprint – EvTexture: Event-driven Texture Enhancement for Video Super-Resolution

In this episode, we discuss EvTexture: Event-driven Texture Enhancement for Video Super-Resolution by Dachun Kai, Jiayao Lu, Yueyi Zhang, Xiaoyan Sun. The paper introduces EvTexture, the first video super-resolution (VSR) method using event signals specifically for enhancing texture details. The proposed method employs a new texture enhancement branch and an iterative module to progressively refine…

June 24, 2024
arxiv preprint – MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model

In this episode, we discuss MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model by Muyao Niu, Xiaodong Cun, Xintao Wang, Yong Zhang, Ying Shan, Yinqiang Zheng. MOFA-Video is a novel image animation technique that produces videos from a single image using various control signals like human landmarks, manual trajectories,…

June 21, 2024
arxiv preprint – An Image is Worth More Than 16×16 Patches: Exploring Transformers on Individual Pixels

In this episode, we discuss An Image is Worth More Than 16×16 Patches: Exploring Transformers on Individual Pixels by Duy-Kien Nguyen, Mahmoud Assran, Unnat Jain, Martin R. Oswald, Cees G. M. Snoek, Xinlei Chen. This paper questions the necessity of locality inductive bias in modern computer vision architectures by showing that vanilla Transformers can treat…

June 20, 2024
arxiv preprint – Graphic Design with Large Multimodal Model

In this episode, we discuss Graphic Design with Large Multimodal Model by Yutao Cheng, Zhao Zhang, Maoke Yang, Hui Nie, Chunyuan Li, Xinglong Wu, Jie Shao. The paper introduces Hierarchical Layout Generation (HLG) for graphic design, which creates compositions from unordered sets of design elements, addressing limitations of the existing Graphic Layout Generation (GLG). The…

June 19, 2024
arxiv preprint – LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

In this episode, we discuss LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning by Dantong Niu, Yuvan Sharma, Giscard Biamby, Jerome Quenum, Yutong Bai, Baifeng Shi, Trevor Darrell, Roei Herzig. The paper introduces LLARVA, a model improved with a novel instruction-tuning method to unify various robotic tasks using structured prompts. The model utilizes 2-D visual traces…

June 18, 2024
arxiv preprint – Transformers need glasses! Information over-squashing in language tasks

In this episode, we discuss Transformers need glasses! Information over-squashing in language tasks by Federico Barbero, Andrea Banino, Steven Kapturowski, Dharshan Kumaran, João G. M. Araújo, Alex Vitvitskyi, Razvan Pascanu, Petar Veličković. The paper explores how information propagates in decoder-only Transformers, revealing a phenomenon where different input sequences can result in nearly identical final token…

June 17, 2024
arxiv preprint – Show, Don’t Tell: Aligning Language Models with Demonstrated Feedback

In this episode, we discuss Show, Don’t Tell: Aligning Language Models with Demonstrated Feedback by Omar Shaikh, Michelle Lam, Joey Hejna, Yijia Shao, Michael Bernstein, Diyi Yang. The paper introduces Demonstration ITerated Task Optimization (DITTO), a method for customizing language model outputs using fewer than ten demonstrations as feedback. DITTO, based on online imitation learning,…

June 14, 2024
arxiv preprint – TextGrad: Automatic ”Differentiation” via Text

In this episode, we discuss TextGrad: Automatic “Differentiation” via Text by Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, James Zou. The paper introduces TEXTGRAD, a novel framework that automates the optimization of compound AI systems by utilizing textual feedback from large language models (LLMs). TEXTGRAD treats text feedback as a…

June 13, 2024