Category: Uncategorized

  • arxiv preprint – SHIC: Shape-Image Correspondences with no Keypoint Supervision

    In this episode, we discuss SHIC: Shape-Image Correspondences with no Keypoint Supervision by Aleksandar Shtedritski, Christian Rupprecht, Andrea Vedaldi. The paper introduces SHIC, a novel method for learning canonical surface mappings without manual supervision by using foundation models such as DINO and Stable Diffusion. SHIC simplifies the task to image-to-image correspondence prediction, outperforming some supervised…

  • arxiv preprint – E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding

    In this episode, we discuss E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding by Ye Liu, Zongyang Ma, Zhongang Qi, Yang Wu, Ying Shan, Chang Wen Chen. The paper introduces E.T. Bench, a comprehensive benchmark for fine-grained event-level video understanding, evaluating Video-LLMs across 12 tasks and 7K videos. It highlights the challenges these models face in…

  • arxiv preprint – LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness

    In this episode, we discuss LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness by Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, Xihui Liu. Recent advancements in Large Multimodal Models (LMMs) have significantly improved 2D visual understanding but 3D scene understanding has lagged due to dataset and encoder limitations. The paper introduces…

  • arxiv preprint – DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

    In this episode, we discuss DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos by Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, Ying Shan. DepthCrafter is a novel method for estimating temporally consistent depth in open-world videos without needing additional data like camera poses or optical flow. It…

  • arxiv preprint – Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

    In this episode, we discuss Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale by Fan Zhou, Zengzhi Wang, Qian Liu, Junlong Li, Pengfei Liu. The paper introduces Programming Every Example (PROX), a framework that enables small language models to refine pre-training corpora by executing fine-grained operations on individual examples, outperforming traditional human-crafted…

  • arxiv preprint – Phantom of Latent for Large Language and Vision Models

    In this episode, we discuss Phantom of Latent for Large Language and Vision Models by Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, Yong Man Ro. The paper introduces Phantom, an efficient LLVM family designed to perform comparably to larger models but with significantly smaller sizes, ranging from 0.5B to 7B parameters. By temporarily…

  • arxiv preprint – Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

    In this episode, we discuss Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think by Gonzalo Martin Garcia, Karim Abou Zeid, Christian Schmidt, Daan de Geus, Alexander Hermans, Bastian Leibe. The study identifies and corrects a flaw in the inference pipeline of large diffusion models used for monocular depth estimation, achieving over 200× speed improvement…

  • arxiv preprint – On the Diagram of Thought

    In this episode, we discuss On the Diagram of Thought by Yifan Zhang, Yang Yuan, Andrew Chi-Chih Yao. Diagram of Thought (DoT) is a framework for modeling iterative reasoning in large language models (LLMs) using a directed acyclic graph (DAG) to organize propositions, critiques, refinements, and verifications. This method allows the model to navigate complex…

  • arxiv preprint – Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

    In this episode, we discuss Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources by Alisia Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi-Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli. The paper presents Source2Synth, a method designed to enhance Large Language Models (LLMs) by generating synthetic data with intermediate reasoning steps, grounded…

  • arxiv preprint – SongCreator: Lyrics-based Universal Song Generation

    In this episode, we discuss SongCreator: Lyrics-based Universal Song Generation by Shun Lei, Yixuan Zhou, Boshi Tang, Max W. Y. Lam, Feng Liu, Hangyu Liu, Jingcheng Wu, Shiyin Kang, Zhiyong Wu, Helen Meng. The paper introduces SongCreator, a novel song-generation system designed to create songs with both vocals and accompaniment from given lyrics. This is…

  • arxiv preprint – Achieving Human Level Competitive Robot Table Tennis

    In this episode, we discuss Achieving Human Level Competitive Robot Table Tennis by David B. D’Ambrosio, Saminda Abeyruwan, Laura Graesser, Atil Iscen, Heni Ben Amor, Alex Bewley, Barney J. Reed, Krista Reymann, Leila Takayama, Yuval Tassa, Krzysztof Choromanski, Erwin Coumans, Deepali Jain, Navdeep Jaitly, Natasha Jaques, Satoshi Kataoka, Yuheng Kuang, Nevena Lazic, Reza Mahjourian, Sherry…

  • arxiv preprint – Sapiens: Foundation for Human Vision Models

    In this episode, we discuss Sapiens: Foundation for Human Vision Models by Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, Shunsuke Saito. The Sapiens model family addresses four key human-centric vision tasks and supports 1K high-resolution inference, with easy adaptability through fine-tuning on a large dataset of human images.…

  • arxiv preprint – Re-Reading Improves Reasoning in Large Language Models

    In this episode, we discuss Re-Reading Improves Reasoning in Large Language Models by Xiaohan Xu, Chongyang Tao, Tao Shen, Can Xu, Hongbo Xu, Guodong Long, Jian-guang Lou. The paper presents a novel prompting method called RE2 (Re-Reading) that improves the reasoning capabilities of Large Language Models by processing questions twice for better understanding. Unlike conventional…

  • arxiv preprint – SPIRE: Semantic Prompt-Driven Image Restoration

    In this episode, we discuss SPIRE: Semantic Prompt-Driven Image Restoration by Chenyang Qi, Zhengzhong Tu, Keren Ye, Mauricio Delbracio, Peyman Milanfar, Qifeng Chen, Hossein Talebi. The paper introduces SPIRE, a novel framework that utilizes semantic and restoration prompts to guide image restoration tasks such as denoising, super-resolution, deblurring, and compression artifact removal. Current text-driven diffusion…

  • arxiv preprint – Automated Design of Agentic Systems

    In this episode, we discuss Automated Design of Agentic Systems by Shengran Hu, Cong Lu, Jeff Clune. The paper introduces Automated Design of Agentic Systems (ADAS), which aims to replace hand-designed AI solutions with automatically created ones using a new approach where agents are defined and improved by a meta agent through programming. They propose…

  • arxiv preprint – Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

    In this episode, we discuss Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model by Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, Omer Levy. The paper introduces Transfusion, a method for training multi-modal models using a combination of language modeling and…

  • arxiv preprint – To Code, or Not To Code? Exploring Impact of Code in Pre-training

    In this episode, we discuss To Code, or Not To Code? Exploring Impact of Code in Pre-training by Viraat Aryabumi, Yixuan Su, Raymond Ma, Adrien Morisot, Ivan Zhang, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, Sara Hooker. In this study, the impact of incorporating code data during pre-training on various downstream tasks was systematically investigated. The…

  • arxiv preprint – Segment Anything with Multiple Modalities

    In this episode, we discuss Segment Anything with Multiple Modalities by Aoran Xiao, Weihao Xuan, Heli Qi, Yun Xing, Naoto Yokoya, Shijian Lu. The paper introduces MM-SAM, an extension of the Segment Anything Model (SAM) tailored for multi-modal data from various sensor suites, such as LiDAR plus RGB and thermal plus RGB. MM-SAM employs unsupervised…

  • arxiv preprint – JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

    In this episode, we discuss JPEG-LM: LLMs as Image Generators with Canonical Codec Representations by Xiaochuang Han, Marjan Ghazvininejad, Pang Wei Koh, Yulia Tsvetkov. The paper introduces a novel approach for image and video generation by modeling them as compressed files using standard codecs like JPEG and AVC/H.264. Instead of pixel-based or vector quantization methods,…

  • arxiv preprint – Mission: Impossible Language Models

    In this episode, we discuss Mission: Impossible Language Models by Julie Kallini, Isabel Papadimitriou, Richard Futrell, Kyle Mahowald, Christopher Potts. The paper investigates Chomsky’s claim that large language models (LLMs) can learn both possible and impossible languages by designing synthetic impossible languages with unnatural word orders and grammar rules. Experiments conducted using GPT-2 small models…

  • arxiv preprint – Learning Task Decomposition to Assist Humans in Competitive Programming

    In this episode, we discuss Learning Task Decomposition to Assist Humans in Competitive Programming by Jiaxin Wen, Ruiqi Zhong, Pei Ke, Zhihong Shao, Hongning Wang, Minlie Huang. The paper presents a method to enhance human understanding and repair of language model (LM)-generated solutions by automatically breaking down complex solutions into simpler subtasks. They introduce a…

  • arxiv preprint – IPAdapter-Instruct: Resolving Ambiguity in Image-based Conditioning using Instruct Prompts

    In this episode, we discuss IPAdapter-Instruct: Resolving Ambiguity in Image-based Conditioning using Instruct Prompts by Ciara Rowles, Shimon Vainer, Dante De Nigris, Slava Elizarov, Konstantin Kutsy, Simon Donné. The paper discusses IPAdapter-Instruct, a method combining natural-image conditioning with “Instruct” prompts to enable nuanced control over image generation. This approach allows for multiple interpretations (like style…

  • arxiv preprint – Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    In this episode, we discuss Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters by Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar. The paper explores the impact of increased inference-time computation on Large Language Models (LLMs) to enhance their performance on challenging prompts. It examines two primary methods for scaling…

  • arxiv preprint – Language Model Can Listen While Speaking

    In this episode, we discuss Language Model Can Listen While Speaking by Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen. The paper explores enhancing real-time interaction in speech-based conversational AI by introducing listening-while-speaking language models (LSLM) for full duplex communication. LSLM integrates simultaneous listening and speaking capabilities…

  • arxiv preprint – Improving Text Embeddings for Smaller Language Models Using Contrastive Fine-tuning

    In this episode, we discuss Improving Text Embeddings for Smaller Language Models Using Contrastive Fine-tuning by Trapoom Ukarapol, Zhicheng Lee, Amy Xin. The paper investigates enhancing smaller language models, like MiniCPM, through improved text embeddings via contrastive fine-tuning on the NLI dataset. Results indicate that this fine-tuning significantly improves performance across multiple benchmarks, with MiniCPM…