Category: Uncategorized
-
Arxiv paper – A Preliminary Study for GPT-4o on Image Restoration
In this episode, we discuss A Preliminary Study for GPT-4o on Image Restoration by Hao Yang, Yan Yang, Ruikun Zhang, Liyuan Pan. This paper presents the first comprehensive evaluation of OpenAI’s GPT-4o model on various image restoration tasks, revealing that while its outputs are visually appealing, they often lack pixel-level structural accuracy. The authors demonstrate…
-
Arxiv paper – DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion
In this episode, we discuss DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion by Qitao Zhao, Amy Lin, Jeff Tan, Jason Y. Zhang, Deva Ramanan, Shubham Tulsiani. The paper introduces DiffusionSfM, a novel data-driven framework that directly infers 3D scene geometry and camera poses from multi-view images using a transformer-based denoising diffusion…
-
Arxiv paper – RayZer: A Self-supervised Large View Synthesis Model
In this episode, we discuss RayZer: A Self-supervised Large View Synthesis Model by Hanwen Jiang, Hao Tan, Peng Wang, Haian Jin, Yue Zhao, Sai Bi, Kai Zhang, Fujun Luan, Kalyan Sunkavalli, Qixing Huang, Georgios Pavlakos. RayZer is a self-supervised multi-view 3D vision model that learns 3D scene understanding without any 3D supervision, including camera poses…
-
Arxiv paper – Reinforcement Learning for Reasoning in Large Language Models with One Training Example
In this episode, we discuss Reinforcement Learning for Reasoning in Large Language Models with One Training Example by Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, Yelong Shen. The paper demonstrates that reinforcement learning with verifiable…
-
Arxiv paper – MINERVA: Evaluating Complex Video Reasoning
In this episode, we discuss MINERVA: Evaluating Complex Video Reasoning by Arsha Nagrani, Sachit Menon, Ahmet Iscen, Shyamal Buch, Ramin Mehran, Nilpa Jha, Anja Hauth, Yukun Zhu, Carl Vondrick, Mikhail Sirotenko, Cordelia Schmid, Tobias Weyand. The paper introduces MINERVA, a new video reasoning dataset featuring complex multi-step questions with detailed reasoning traces to evaluate multimodal…
-
Arxiv paper – The Leaderboard Illusion
In this episode, we discuss The Leaderboard Illusion by Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D’Souza, Sayash Kapoor, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah Smith, Beyza Ermis, Marzieh Fadaee, Sara Hooker. The paper reveals that Chatbot Arena’s leaderboard rankings are biased due to undisclosed private testing, allowing some providers to selectively…
-
Arxiv paper – Towards Understanding Camera Motions in Any Video
In this episode, we discuss Towards Understanding Camera Motions in Any Video by Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, Rushikesh Zawar, Xue Bai, Yilun Du, Chuang Gan, Deva Ramanan. The paper presents CameraBench, a large-scale, expertly annotated video dataset and benchmark…
-
Arxiv paper – Describe Anything: Detailed Localized Image and Video Captioning
In this episode, we discuss Describe Anything: Detailed Localized Image and Video Captioning by Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, Yin Cui. The paper presents the Describe Anything Model (DAM) for detailed localized captioning that integrates local detail and global context…
-
Arxiv paper – MCNC: MANIFOLD-CONSTRAINED REPARAMETERIZATION FOR NEURAL COMPRESSION
In this episode, we discuss MCNC: MANIFOLD-CONSTRAINED REPARAMETERIZATION FOR NEURAL COMPRESSION by The authors of the paper are: – Chayne Thrash – Ali Abbasi – Reed Andreas – Parsa Nooralinejad – Soroush Abbasi Koohpayegani – Hamed Pirsiavash – Soheil Kolouri. The paper introduces Manifold-Constrained Neural Compression (MCNC), a novel model compression technique that confines parameters…
-
Arxiv paper – Self-Improving Robust Preference Optimization
In this episode, we discuss Self-Improving Robust Preference Optimization by Eugene Choi, Arash Ahmadian, Matthieu Geist, Oilvier Pietquin, Mohammad Gheshlaghi Azar. The paper introduces Self-Improving Robust Preference Optimization (SRPO), an offline RLHF framework that enables models to self-improve and generalize across tasks by jointly optimizing a self-improvement and generative policy through a min-max objective. SRPO…
-
Arxiv paper – LLM Post-Training: A Deep Dive into Reasoning Large Language Models
In this episode, we discuss LLM Post-Training: A Deep Dive into Reasoning Large Language Models by Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip H. S. Torr, Fahad Shahbaz Khan, Salman Khan. The paper surveys post-training techniques for Large Language Models (LLMs) that enhance performance beyond initial…
-
Arxiv paper – Welcome to the Era of Experience
In this episode, we discuss Welcome to the Era of Experience by David Silver, Richard S. Sutton. The paper discusses the forthcoming era of artificial intelligence marked by agents with superhuman capabilities. These agents will primarily learn through experience. The note highlights the essential features that will characterize this new phase in AI development.
-
Arxiv paper – MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation
In this episode, we discuss MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation by Sihyun Yu, Meera Hahn, Dan Kondratyuk, Jinwoo Shin, Agrim Gupta, José Lezama, Irfan Essa, David Ross, Jonathan Huang. The paper introduces MALT Diffusion, a new diffusion model designed for generating long videos by dividing them into short segments and using…
-
Arxiv paper – InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
In this episode, we discuss InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models by The authors of the paper “InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models” are as follows: 1. **Jinguo Zhu** 2. **Weiyun Wang** 3. **Zhe Chen** 4. … InternVL3 advances the InternVL series by jointly training…
-
Arxiv paper – EquiVDM: Equivariant Video Diffusion Models with Temporally Consistent Noise
In this episode, we discuss EquiVDM: Equivariant Video Diffusion Models with Temporally Consistent Noise by The authors of the paper are: – **Chao Liu** – **Arash Vahdat**. The paper presents a video diffusion framework that utilizes temporally consistent noise to generate coherent and high-quality video frames without needing specialized modules. By ensuring the model handles…
-
Arxiv paper – TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
In this episode, we discuss TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning by Xingjian Zhang, Siwei Wen, Wenjun Wu, Lei Huang. The paper introduces TinyLLaVA-Video-R1, a small-scale video reasoning model with no more than 4 billion parameters, designed to enhance reasoning abilities using reinforcement learning on general Video-QA datasets. Unlike previous studies that focus on…
-
Arxiv paper – Reasoning Models Don’t Always Say What They Think
In this episode, we discuss Reasoning Models Don’t Always Say What They Think by The authors of the paper “Reasoning Models Don’t Always Say What They Think” are: 1. Yanda Chen 2. Joe Benton 3. Ansh Radhakrishnan 4. Jonathan Uesato 5. Carson Denison 6. John Schulman 7. Arushi Somani 8. Peter Hase 9. Misha Wagner…
-
Arxiv paper – Slow-Fast Architecture for Video Multi-Modal Large Language Models
In this episode, we discuss Slow-Fast Architecture for Video Multi-Modal Large Language Models by Min Shi, Shihao Wang, Chieh-Yun Chen, Jitesh Jain, Kai Wang, Junjun Xiong, Guilin Liu, Zhiding Yu, Humphrey Shi. The paper presents a slow-fast architecture for video-based multi-modal large language models that uses a dual-token system to balance temporal resolution and spatial…
-
Arxiv paper – TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes
In this episode, we discuss TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes by Nikai Du, Zhennan Chen, Zhizhou Chen, Shan Gao, Xi Chen, Zhengkai Jiang, Jian Yang, Ying Tai. The paper addresses Complex Visual Text Generation (CVTG), which involves creating detailed textual content within images but often suffers from issues like distortion and…
-
Arxiv paper – VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
In this episode, we discuss VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning by Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, Mike Zheng Shou. The paper introduces VideoMind, a novel video-language agent designed for precise temporal-grounded video understanding. It employs a role-based workflow with components like a planner, grounder, verifier, and answerer, integrated efficiently…
-
Arxiv paper – SynCity: Training-Free Generation of 3D Worlds
In this episode, we discuss SynCity: Training-Free Generation of 3D Worlds by Paul Engstler, Aleksandar Shtedritski, Iro Laina, Christian Rupprecht, Andrea Vedaldi. The paper presents SynCity, a novel method for generating expansive 3D worlds directly from textual descriptions without requiring additional training or optimization. SynCity combines the geometric accuracy of pre-trained 3D generative models with…
-
Arxiv paper – HD-EPIC: A Highly-Detailed Egocentric Video Dataset
In this episode, we discuss HD-EPIC: A Highly-Detailed Egocentric Video Dataset by Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin Flanagan, Jacob Chalk, Zhifan Zhu, Rhodri Guerrier, Fahd Abdelazim, Bin Zhu, Davide Moltisanti, Michael Wray, Hazel Doughty, Dima Damen. The paper introduces HD-EPIC, a…
-
Arxiv paper – Video-T1: Test-Time Scaling for Video Generation
In this episode, we discuss Video-T1: Test-Time Scaling for Video Generation by Fangfu Liu, Hanyang Wang, Yimo Cai, Kaiyan Zhang, Xiaohang Zhan, Yueqi Duan. The paper investigates Test-Time Scaling (TTS) for video generation, aiming to enhance video quality by leveraging additional inference-time computation instead of expanding model size or training data. The authors treat video…
-
Arxiv paper – Calibrated Multi-Preference Optimization for Aligning Diffusion Models
In this episode, we discuss Calibrated Multi-Preference Optimization for Aligning Diffusion Models by Kyungmin Lee, Xiaohang Li, Qifei Wang, Junfeng He, Junjie Ke, Ming-Hsuan Yang, Irfan Essa, Jinwoo Shin, Feng Yang, Yinxiao Li. The paper introduces Calibrated Preference Optimization (CaPO), a new method for aligning text-to-image diffusion models using multiple reward models without requiring expensive…
-
Arxiv paper – Personalize Anything for Free with Diffusion Transformer
In this episode, we discuss Personalize Anything for Free with Diffusion Transformer by Haoran Feng, Zehuan Huang, Lin Li, Hairong Lv, Lu Sheng. The paper introduces *Personalize Anything*, a training-free framework for personalized image generation using diffusion transformers (DiTs). By replacing denoising tokens with those of a reference subject, the method enables zero-shot subject reconstruction…