Category: Uncategorized
-
Arxiv paper – Heuristically Adaptive Diffusion-Model Evolutionary Strategy
In this episode, we discuss Heuristically Adaptive Diffusion-Model Evolutionary Strategy by Benedikt Hartl, Yanbo Zhang, Hananel Hazan, Michael Levin. The paper explores the connection between diffusion models and evolutionary algorithms, highlighting that both generate high-quality samples through iterative refinement of random initial states. By integrating deep learning-based diffusion models into evolutionary processes, the authors enhance…
-
Arxiv paper – Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
In this episode, we discuss Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach by Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Tom Goldstein. The paper presents a new language model architecture that enhances test-time computation by iteratively reasoning in latent space using…
-
Arxiv paper – EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
In this episode, we discuss EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents by Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang. The paper presents **EMBODIEDBENCH**, a comprehensive benchmark with 1,128 tasks across…
-
Arxiv paper – VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
In this episode, we discuss VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection by Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, Si Liu. The paper introduces VideoEspresso, a high-quality, large-scale VideoQA dataset that maintains essential spatial and temporal details…
-
Arxiv paper – VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models
In this episode, we discuss VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models by Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, Shelly Sheynin. Generative video models typically prioritize appearance accuracy over motion coherence, limiting their ability to capture realistic dynamics. The paper presents VideoJAM, a…
-
Arxiv paper – HunyuanVideo: A Systematic Framework For Large Video Generative Models
In this episode, we discuss HunyuanVideo: A Systematic Framework For Large Video Generative Models by Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan,…
-
Arxiv paper – s1: Simple test-time scaling
In this episode, we discuss s1: Simple test-time scaling by Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto. The paper introduces a straightforward method for test-time scaling in language models to enhance reasoning performance by utilizing additional computational resources during inference. The…
-
Arxiv paper – Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
In this episode, we discuss Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation by The authors of the paper are the **Hunyuan3D Team**. Specific contributor names are indicated to be listed at the end of the full report.. Hunyuan3D 2.0 is a large-scale 3D synthesis system featuring Hunyuan3D-DiT for generating detailed…
-
Arxiv paper – MatAnyone: Stable Video Matting with Consistent Memory Propagation
In this episode, we discuss MatAnyone: Stable Video Matting with Consistent Memory Propagation by Peiqing Yang, Shangchen Zhou, Jixin Zhao, Qingyi Tao, Chen Change Loy. The paper introduces **MatAnyone**, a robust framework for target-assigned video matting that overcomes challenges posed by complex or ambiguous backgrounds without relying on auxiliary inputs. It employs a memory-based approach…
-
Arxiv paper – Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate
In this episode, we discuss Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate by Yubo Wang, Xiang Yue, Wenhu Chen. The paper introduces Critique Fine-Tuning (CFT), a novel approach where language models are trained to critique noisy responses instead of simply imitating correct ones, inspired by human critical thinking. Using a…
-
Arxiv paper – Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
In this episode, we discuss Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs by Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu. The paper identifies “underthinking” in large language models like…
-
Arxiv paper – MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
In this episode, we discuss MetaMorph: Multimodal Understanding and Generation via Instruction Tuning by Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, Zhuang Liu. The paper introduces Visual-Predictive Instruction Tuning (VPiT), which enhances pretrained large language models to generate both text and visual tokens by…
-
Arxiv paper – Improving Video Generation with Human Feedback
In this episode, we discuss Improving Video Generation with Human Feedback by Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, Wanli Ouyang. The paper introduces a pipeline that utilizes…
-
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
In this episode, we discuss Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling by The authors of the paper are: – Xiaokang Chen – Zhiyu Wu – Xingchao Liu – Zizheng Pan – Wen Liu – Zhenda Xie – Xingkai Yu – Chong Ruan. The paper introduces Janus-Pro, an enhanced version of…
-
Arxiv paper – DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
In this episode, we discuss DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning by DeepSeek-AI. The paper introduces DeepSeek-R1-Zero, a reasoning model trained solely with large-scale reinforcement learning, which exhibits strong reasoning abilities but struggles with readability and language mixing. To overcome these limitations, the authors developed DeepSeek-R1 by adding multi-stage training and cold-start…
-
Arxiv paper – Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step
In this episode, we discuss Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step by Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Peng Gao, Hongsheng Li, Pheng-Ann Heng. The paper investigates the use of Chain-of-Thought (CoT) reasoning to improve autoregressive image generation through techniques like test-time computation scaling,…
-
Arxiv paper – Improving Factuality with Explicit Working Memory
In this episode, we discuss Improving Factuality with Explicit Working Memory by Mingda Chen, Yang Li, Karthik Padthe, Rulin Shao, Alicia Sun, Luke Zettlemoyer, Gargi Gosh, Wen-tau Yih. The paper presents Ewe, a novel method that incorporates explicit working memory into large language models to improve factuality in long-form text generation by updating memory in…
-
Arxiv paper – Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
In this episode, we discuss Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control by Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, Wenping Wang, Yuan Liu. The paper introduces “Diffusion as Shader” (DaS), a novel approach that supports various video…
-
Arxiv paper – FaceLift: Single Image to 3D Head with View Generation and GS-LRM
In this episode, we discuss FaceLift: Single Image to 3D Head with View Generation and GS-LRM by Weijie Lyu, Yi Zhou, Ming-Hsuan Yang, Zhixin Shu. FaceLift is a feed-forward approach for rapid and high-quality 360-degree head reconstruction using a single image, utilizing a multi-view latent diffusion model followed by a GS-LRM reconstructor to create 3D…
-
Arxiv paper – GenHMR: Generative Human Mesh Recovery
In this episode, we discuss GenHMR: Generative Human Mesh Recovery by Muhammad Usama Saleem, Ekkasit Pinyoanuntapong, Pu Wang, Hongfei Xue, Srijan Das, Chen Chen. The paper introduces GenHMR, a novel generative framework for human mesh recovery (HMR) that addresses uncertainties in converting 2D images to 3D mesh. It employs a pose tokenizer and an image-conditional…
-
Arxiv paper – Video Creation by Demonstration
In this episode, we discuss Video Creation by Demonstration by Yihong Sun, Hao Zhou, Liangzhe Yuan, Jennifer J. Sun, Yandong Li, Xuhui Jia, Hartwig Adam, Bharath Hariharan, Long Zhao, Ting Liu. The paper introduces Video Creation by Demonstration, utilizing a method called 𝛿-Diffusion to generate videos that smoothly continue from a given context image, integrating…
-
Arxiv paper – Byte Latent Transformer: Patches Scale Better Than Tokens
In this episode, we discuss Byte Latent Transformer: Patches Scale Better Than Tokens by Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, Srinivasan Iyer. The Byte Latent Transformer (BLT) presents a novel approach to large language models…
-
Arxiv paper – Align3R: Aligned Monocular Depth Estimation for Dynamic Videos
In this episode, we discuss Align3R: Aligned Monocular Depth Estimation for Dynamic Videos by Jiahao Lu, Tianyu Huang, Peng Li, Zhiyang Dou, Cheng Lin, Zhiming Cui, Zhen Dong, Sai-Kit Yeung, Wenping Wang, Yuan Liu. Align3R is introduced as a method for achieving temporally consistent depth maps in videos using monocular inputs, addressing the challenge of…
-
Arxiv paper – FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion
In this episode, we discuss FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion by Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, Ziwei Liu. The paper introduces FreeScale, a tuning-free inference method that enhances visual diffusion models’ ability to generate high-resolution images by combining data from…
-
Arxiv paper – ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
In this episode, we discuss ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis by Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, Yonghong Tian. ViewCrafter introduces a new method for synthesizing high-fidelity novel views from single or sparse images, using video diffusion models…