Category: Uncategorized

Arxiv paper – How much do language models memorize?

In this episode, we discuss How much do language models memorize? by John X. Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G. Edward Suh, Alexander M. Rush, Kamalika Chaudhuri, Saeed Mahloujifar. The paper introduces a method to quantify how much a language model memorizes versus generalizes from data, defining model capacity as total memorization excluding…

June 5, 2025
Arxiv paper – MMaDA: Multimodal Large Diffusion Language Models

In this episode, we discuss MMaDA: Multimodal Large Diffusion Language Models by Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang. MMaDA is a unified multimodal diffusion foundation model that leverages a modality-agnostic architecture, a mixed long chain-of-thought fine-tuning strategy, and a novel unified policy-gradient reinforcement learning algorithm to excel…

June 3, 2025
Arxiv paper – Superhuman performance of a large language model on the reasoning tasks of a physician

In this episode, we discuss Superhuman performance of a large language model on the reasoning tasks of a physician by Peter G. Brodeur, Thomas A. Buckley, Zahir Kanjee, Ethan Goh, Evelyn Bin Ling, Priyank Jain, Stephanie Cabral, Raja-Elie Abdulnour, Adrian D. Haimovich, Jason A. Freed, Andrew Olson, Daniel J. Morgan, Jason Hom, Robert Gallo, Liam…

June 2, 2025
Arxiv paper – The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

In this episode, we discuss The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models by Seungone Kim, Juyoung Suk, Ji Yong Cho, Shayne Longpre, Chaeeun Kim, Dongkeun Yoon, Guijin Son, Yejin Cho, Sheikh Shafayat, Jinheon Baek, Sue Hyun Park, Hyeonbin Hwang, Jinkyung Jo, Hyowon Cho, Haebin Shin, Seongyun Lee,…

May 29, 2025
Arxiv paper – DanceGRPO: Unleashing GRPO on Visual Generation

In this episode, we discuss DanceGRPO: Unleashing GRPO on Visual Generation by Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, Ping Luo. The paper presents DanceGRPO, a unified reinforcement learning framework that adapts Group Relative Policy Optimization to various generative paradigms, including diffusion…

May 27, 2025
Arxiv paper – Visual Planning: Let’s Think Only with Images

In this episode, we discuss Visual Planning: Let’s Think Only with Images by Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, Ivan Vulić. This paper proposes Visual Planning, a new approach that uses purely visual sequences to perform reasoning and planning without relying on text. They introduce a reinforcement learning framework,…

May 21, 2025
Arxiv paper – A Preliminary Study for GPT-4o on Image Restoration

In this episode, we discuss A Preliminary Study for GPT-4o on Image Restoration by Hao Yang, Yan Yang, Ruikun Zhang, Liyuan Pan. This paper presents the first comprehensive evaluation of OpenAI’s GPT-4o model on various image restoration tasks, revealing that while its outputs are visually appealing, they often lack pixel-level structural accuracy. The authors demonstrate…

May 14, 2025
Arxiv paper – DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion

In this episode, we discuss DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion by Qitao Zhao, Amy Lin, Jeff Tan, Jason Y. Zhang, Deva Ramanan, Shubham Tulsiani. The paper introduces DiffusionSfM, a novel data-driven framework that directly infers 3D scene geometry and camera poses from multi-view images using a transformer-based denoising diffusion…

May 12, 2025
Arxiv paper – RayZer: A Self-supervised Large View Synthesis Model

In this episode, we discuss RayZer: A Self-supervised Large View Synthesis Model by Hanwen Jiang, Hao Tan, Peng Wang, Haian Jin, Yue Zhao, Sai Bi, Kai Zhang, Fujun Luan, Kalyan Sunkavalli, Qixing Huang, Georgios Pavlakos. RayZer is a self-supervised multi-view 3D vision model that learns 3D scene understanding without any 3D supervision, including camera poses…

May 9, 2025
Arxiv paper – Reinforcement Learning for Reasoning in Large Language Models with One Training Example

In this episode, we discuss Reinforcement Learning for Reasoning in Large Language Models with One Training Example by Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, Yelong Shen. The paper demonstrates that reinforcement learning with verifiable…

May 7, 2025
Arxiv paper – MINERVA: Evaluating Complex Video Reasoning

In this episode, we discuss MINERVA: Evaluating Complex Video Reasoning by Arsha Nagrani, Sachit Menon, Ahmet Iscen, Shyamal Buch, Ramin Mehran, Nilpa Jha, Anja Hauth, Yukun Zhu, Carl Vondrick, Mikhail Sirotenko, Cordelia Schmid, Tobias Weyand. The paper introduces MINERVA, a new video reasoning dataset featuring complex multi-step questions with detailed reasoning traces to evaluate multimodal…

May 6, 2025
Arxiv paper – The Leaderboard Illusion

In this episode, we discuss The Leaderboard Illusion by Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D’Souza, Sayash Kapoor, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah Smith, Beyza Ermis, Marzieh Fadaee, Sara Hooker. The paper reveals that Chatbot Arena’s leaderboard rankings are biased due to undisclosed private testing, allowing some providers to selectively…

May 6, 2025
Arxiv paper – Towards Understanding Camera Motions in Any Video

In this episode, we discuss Towards Understanding Camera Motions in Any Video by Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, Rushikesh Zawar, Xue Bai, Yilun Du, Chuang Gan, Deva Ramanan. The paper presents CameraBench, a large-scale, expertly annotated video dataset and benchmark…

May 5, 2025
Arxiv paper – Describe Anything: Detailed Localized Image and Video Captioning

In this episode, we discuss Describe Anything: Detailed Localized Image and Video Captioning by Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, Yin Cui. The paper presents the Describe Anything Model (DAM) for detailed localized captioning that integrates local detail and global context…

April 29, 2025
Arxiv paper – MCNC: MANIFOLD-CONSTRAINED REPARAMETERIZATION FOR NEURAL COMPRESSION

In this episode, we discuss MCNC: MANIFOLD-CONSTRAINED REPARAMETERIZATION FOR NEURAL COMPRESSION by The authors of the paper are: – Chayne Thrash – Ali Abbasi – Reed Andreas – Parsa Nooralinejad – Soroush Abbasi Koohpayegani – Hamed Pirsiavash – Soheil Kolouri. The paper introduces Manifold-Constrained Neural Compression (MCNC), a novel model compression technique that confines parameters…

April 28, 2025
Arxiv paper – Self-Improving Robust Preference Optimization

In this episode, we discuss Self-Improving Robust Preference Optimization by Eugene Choi, Arash Ahmadian, Matthieu Geist, Oilvier Pietquin, Mohammad Gheshlaghi Azar. The paper introduces Self-Improving Robust Preference Optimization (SRPO), an offline RLHF framework that enables models to self-improve and generalize across tasks by jointly optimizing a self-improvement and generative policy through a min-max objective. SRPO…

April 23, 2025
Arxiv paper – LLM Post-Training: A Deep Dive into Reasoning Large Language Models

In this episode, we discuss LLM Post-Training: A Deep Dive into Reasoning Large Language Models by Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip H. S. Torr, Fahad Shahbaz Khan, Salman Khan. The paper surveys post-training techniques for Large Language Models (LLMs) that enhance performance beyond initial…

April 22, 2025
Arxiv paper – Welcome to the Era of Experience

In this episode, we discuss Welcome to the Era of Experience by David Silver, Richard S. Sutton. The paper discusses the forthcoming era of artificial intelligence marked by agents with superhuman capabilities. These agents will primarily learn through experience. The note highlights the essential features that will characterize this new phase in AI development.

April 21, 2025
Arxiv paper – MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation

In this episode, we discuss MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation by Sihyun Yu, Meera Hahn, Dan Kondratyuk, Jinwoo Shin, Agrim Gupta, José Lezama, Irfan Essa, David Ross, Jonathan Huang. The paper introduces MALT Diffusion, a new diffusion model designed for generating long videos by dividing them into short segments and using…

April 19, 2025
Arxiv paper – InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

In this episode, we discuss InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models by The authors of the paper “InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models” are as follows: 1. **Jinguo Zhu** 2. **Weiyun Wang** 3. **Zhe Chen** 4. … InternVL3 advances the InternVL series by jointly training…

April 17, 2025
Arxiv paper – EquiVDM: Equivariant Video Diffusion Models with Temporally Consistent Noise

In this episode, we discuss EquiVDM: Equivariant Video Diffusion Models with Temporally Consistent Noise by The authors of the paper are: – **Chao Liu** – **Arash Vahdat**. The paper presents a video diffusion framework that utilizes temporally consistent noise to generate coherent and high-quality video frames without needing specialized modules. By ensuring the model handles…

April 16, 2025
Arxiv paper – TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning

In this episode, we discuss TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning by Xingjian Zhang, Siwei Wen, Wenjun Wu, Lei Huang. The paper introduces TinyLLaVA-Video-R1, a small-scale video reasoning model with no more than 4 billion parameters, designed to enhance reasoning abilities using reinforcement learning on general Video-QA datasets. Unlike previous studies that focus on…

April 15, 2025
Arxiv paper – Reasoning Models Don’t Always Say What They Think

In this episode, we discuss Reasoning Models Don’t Always Say What They Think by The authors of the paper “Reasoning Models Don’t Always Say What They Think” are: 1. Yanda Chen 2. Joe Benton 3. Ansh Radhakrishnan 4. Jonathan Uesato 5. Carson Denison 6. John Schulman 7. Arushi Somani 8. Peter Hase 9. Misha Wagner…

April 9, 2025
Arxiv paper – Slow-Fast Architecture for Video Multi-Modal Large Language Models

In this episode, we discuss Slow-Fast Architecture for Video Multi-Modal Large Language Models by Min Shi, Shihao Wang, Chieh-Yun Chen, Jitesh Jain, Kai Wang, Junjun Xiong, Guilin Liu, Zhiding Yu, Humphrey Shi. The paper presents a slow-fast architecture for video-based multi-modal large language models that uses a dual-token system to balance temporal resolution and spatial…

April 7, 2025
Arxiv paper – TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes

In this episode, we discuss TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes by Nikai Du, Zhennan Chen, Zhizhou Chen, Shan Gao, Xi Chen, Zhengkai Jiang, Jian Yang, Ying Tai. The paper addresses Complex Visual Text Generation (CVTG), which involves creating detailed textual content within images but often suffers from issues like distortion and…

April 4, 2025