Category: Uncategorized

Arxiv paper – DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

In this episode, we discuss DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning by DeepSeek-AI. The paper introduces DeepSeek-R1-Zero, a reasoning model trained solely with large-scale reinforcement learning, which exhibits strong reasoning abilities but struggles with readability and language mixing. To overcome these limitations, the authors developed DeepSeek-R1 by adding multi-stage training and cold-start…

January 27, 2025
Arxiv paper – Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step

In this episode, we discuss Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step by Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Peng Gao, Hongsheng Li, Pheng-Ann Heng. The paper investigates the use of Chain-of-Thought (CoT) reasoning to improve autoregressive image generation through techniques like test-time computation scaling,…

January 24, 2025
Arxiv paper – Improving Factuality with Explicit Working Memory

In this episode, we discuss Improving Factuality with Explicit Working Memory by Mingda Chen, Yang Li, Karthik Padthe, Rulin Shao, Alicia Sun, Luke Zettlemoyer, Gargi Gosh, Wen-tau Yih. The paper presents Ewe, a novel method that incorporates explicit working memory into large language models to improve factuality in long-form text generation by updating memory in…

January 23, 2025
Arxiv paper – Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control

In this episode, we discuss Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control by Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, Wenping Wang, Yuan Liu. The paper introduces “Diffusion as Shader” (DaS), a novel approach that supports various video…

January 17, 2025
Arxiv paper – FaceLift: Single Image to 3D Head with View Generation and GS-LRM

In this episode, we discuss FaceLift: Single Image to 3D Head with View Generation and GS-LRM by Weijie Lyu, Yi Zhou, Ming-Hsuan Yang, Zhixin Shu. FaceLift is a feed-forward approach for rapid and high-quality 360-degree head reconstruction using a single image, utilizing a multi-view latent diffusion model followed by a GS-LRM reconstructor to create 3D…

January 13, 2025
Arxiv paper – GenHMR: Generative Human Mesh Recovery

In this episode, we discuss GenHMR: Generative Human Mesh Recovery by Muhammad Usama Saleem, Ekkasit Pinyoanuntapong, Pu Wang, Hongfei Xue, Srijan Das, Chen Chen. The paper introduces GenHMR, a novel generative framework for human mesh recovery (HMR) that addresses uncertainties in converting 2D images to 3D mesh. It employs a pose tokenizer and an image-conditional…

January 8, 2025
Arxiv paper – Video Creation by Demonstration

In this episode, we discuss Video Creation by Demonstration by Yihong Sun, Hao Zhou, Liangzhe Yuan, Jennifer J. Sun, Yandong Li, Xuhui Jia, Hartwig Adam, Bharath Hariharan, Long Zhao, Ting Liu. The paper introduces Video Creation by Demonstration, utilizing a method called 𝛿-Diffusion to generate videos that smoothly continue from a given context image, integrating…

January 6, 2025
Arxiv paper – Byte Latent Transformer: Patches Scale Better Than Tokens

In this episode, we discuss Byte Latent Transformer: Patches Scale Better Than Tokens by Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, Srinivasan Iyer. The Byte Latent Transformer (BLT) presents a novel approach to large language models…

January 2, 2025
Arxiv paper – Align3R: Aligned Monocular Depth Estimation for Dynamic Videos

In this episode, we discuss Align3R: Aligned Monocular Depth Estimation for Dynamic Videos by Jiahao Lu, Tianyu Huang, Peng Li, Zhiyang Dou, Cheng Lin, Zhiming Cui, Zhen Dong, Sai-Kit Yeung, Wenping Wang, Yuan Liu. Align3R is introduced as a method for achieving temporally consistent depth maps in videos using monocular inputs, addressing the challenge of…

December 17, 2024
Arxiv paper – FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion

In this episode, we discuss FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion by Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, Ziwei Liu. The paper introduces FreeScale, a tuning-free inference method that enhances visual diffusion models’ ability to generate high-resolution images by combining data from…

December 16, 2024
Arxiv paper – ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

In this episode, we discuss ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis by Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, Yonghong Tian. ViewCrafter introduces a new method for synthesizing high-fidelity novel views from single or sparse images, using video diffusion models…

December 11, 2024
Arxiv paper – o1-Coder: an o1 Replication for Coding

In this episode, we discuss o1-Coder: an o1 Replication for Coding by Yuxiang Zhang, Shangxi Wu, Yuqi Yang, Jiangming Shu, Jinlin Xiao, Chao Kong, Jitao Sang. The paper discusses “O1-CODER,” which aims to replicate OpenAI’s o1 model focusing on coding tasks, utilizing reinforcement learning and Monte Carlo Tree Search to boost System-2 thinking. The framework…

December 10, 2024
Arxiv paper – DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

In this episode, we discuss DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning by Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, Aviral Kumar. DigiRL is an innovative autonomous reinforcement learning approach designed to train device control agents by refining pre-trained vision language models through a two-stage process involving offline…

December 6, 2024
ICLR 2025 submission – CYCLE-CONSISTENT LEARNING FOR JOINT LAYOUT-TO-IMAGE GENERATION AND OBJECT DETECTION

In this episode, we discuss CYCLE-CONSISTENT LEARNING FOR JOINT LAYOUT-TO-IMAGE GENERATION AND OBJECT DETECTION by The paper’s authors are listed as “Anonymous authors” since it is under double-blind review.. The paper introduces a new generation-detection cycle consistent (GDCC) learning framework that simultaneously optimizes layout-to-image generation and object detection, highlighting the inherent duality of these tasks.…

December 2, 2024
Arxiv Paper – WonderWorld: Interactive 3D Scene Generation from a Single Image

In this episode, we discuss WonderWorld: Interactive 3D Scene Generation from a Single Image by Hong-Xing Yu, Haoyi Duan, Charles Herrmann, William T. Freeman, Jiajun Wu. WonderWorld is an innovative framework designed for rapid, interactive 3D scene generation, allowing users to specify and view scene contents and layouts with minimal delay. The primary challenge addressed…

November 25, 2024
Arxiv Paper – Hymba: A Hybrid-head Architecture for Small Language Models

In this episode, we discuss Hymba: A Hybrid-head Architecture for Small Language Models by Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov. The paper introduces Hymba, a new family of small language models that combines…

November 22, 2024
Arxiv Paper – Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

In this episode, we discuss Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation by Danny Halawi, Alexander Wei, Eric Wallace, Tony T. Wang, Nika Haghtalab, Jacob Steinhardt. The paper highlights security risks in black-box finetuning interfaces for large language models and introduces covert malicious finetuning, a method to compromise a model’s safety undetected. This involves…

November 21, 2024
Arxiv Paper – Video Instruction Tuning With Synthetic Data

In this episode, we discuss Video Instruction Tuning With Synthetic Data by Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, Chunyuan Li. The paper proposes a high-quality synthetic dataset, LLaVA-Video-178K, to address the challenge of developing large multimodal video models by improving video instruction-following tasks through detailed captioning and question-answering. Using…

November 19, 2024
Arxiv Paper – Generative Agent Simulations of 1,000 People

In this episode, we discuss Generative Agent Simulations of 1,000 People by Joon Sung Park, Carolyn Q. Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Meredith Ringel Morris, Robb Willer, Percy Liang, Michael S. Bernstein. The paper introduces a new agent architecture that simulates the behaviors and attitudes of over 1,000 individuals using large language…

November 19, 2024
NeurIPS 2024 – Moving Off-the-Grid: Scene-Grounded Video Representations

In this episode, we discuss Moving Off-the-Grid: Scene-Grounded Video Representations by Sjoerd van Steenkiste, Daniel Zoran, Yi Yang, Yulia Rubanova, Rishabh Kabra, Carl Doersch, Dilara Gokay, Joseph Heyward, Etienne Pot, Klaus Greff, Drew A. Hudson, Thomas Albert Keck, Joao Carreira, Alexey Dosovitskiy, Mehdi S. M. Sajjadi, Thomas Kipf. The paper introduces the Moving Off-the-Grid (MooG)…

November 15, 2024
Arxiv Paper – Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution

In this episode, we discuss Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution by Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin.…

November 14, 2024
Arxiv Paper – FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality

In this episode, we discuss FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality by Zhengyao Lv, Chenyang Si, Junhao Song, Zhenyu Yang, Yu Qiao, Ziwei Liu, Kwan-Yee K. Wong. FasterCache is introduced as a training-free approach that accelerates inference in video diffusion models by reusing features more efficiently, maintaining high video quality. The strategy…

November 12, 2024
Arxiv Paper – Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA

In this episode, we discuss Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA by Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, Tal Schuster. The paper presents methods to transform large language models into smaller, efficient “Recursive Transformers” by using parameter sharing through revisiting “layer tying”, which reduces model size and cost…

November 11, 2024
Arxiv Paper – Long Context RAG Performance of Large Language Models

In this episode, we discuss Long Context RAG Performance of Large Language Models by Quinn Leng, Jacob Portes, Sam Havens, Matei Zaharia, Michael Carbin. The paper examines the effects of long context lengths on Retrieval Augmented Generation (RAG) in large language models, especially with models supporting contexts over 64k tokens like Anthropic Claude and GPT-4-turbo.…

November 8, 2024
Arxiv Paper – NVLM: Open Frontier-Class Multimodal LLMs

In this episode, we discuss NVLM: Open Frontier-Class Multimodal LLMs by Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping. The paper introduces NVLM 1.0, a set of advanced multimodal large language models that achieve state-of-the-art performance on vision-language tasks and improve upon their…

November 4, 2024