Category: Uncategorized
-
Arxiv paper – VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
In this episode, we discuss VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning by Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, Mike Zheng Shou. The paper introduces VideoMind, a novel video-language agent designed for precise temporal-grounded video understanding. It employs a role-based workflow with components like a planner, grounder, verifier, and answerer, integrated efficiently…
-
Arxiv paper – SynCity: Training-Free Generation of 3D Worlds
In this episode, we discuss SynCity: Training-Free Generation of 3D Worlds by Paul Engstler, Aleksandar Shtedritski, Iro Laina, Christian Rupprecht, Andrea Vedaldi. The paper presents SynCity, a novel method for generating expansive 3D worlds directly from textual descriptions without requiring additional training or optimization. SynCity combines the geometric accuracy of pre-trained 3D generative models with…
-
Arxiv paper – HD-EPIC: A Highly-Detailed Egocentric Video Dataset
In this episode, we discuss HD-EPIC: A Highly-Detailed Egocentric Video Dataset by Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin Flanagan, Jacob Chalk, Zhifan Zhu, Rhodri Guerrier, Fahd Abdelazim, Bin Zhu, Davide Moltisanti, Michael Wray, Hazel Doughty, Dima Damen. The paper introduces HD-EPIC, a…
-
Arxiv paper – Video-T1: Test-Time Scaling for Video Generation
In this episode, we discuss Video-T1: Test-Time Scaling for Video Generation by Fangfu Liu, Hanyang Wang, Yimo Cai, Kaiyan Zhang, Xiaohang Zhan, Yueqi Duan. The paper investigates Test-Time Scaling (TTS) for video generation, aiming to enhance video quality by leveraging additional inference-time computation instead of expanding model size or training data. The authors treat video…
-
Arxiv paper – Calibrated Multi-Preference Optimization for Aligning Diffusion Models
In this episode, we discuss Calibrated Multi-Preference Optimization for Aligning Diffusion Models by Kyungmin Lee, Xiaohang Li, Qifei Wang, Junfeng He, Junjie Ke, Ming-Hsuan Yang, Irfan Essa, Jinwoo Shin, Feng Yang, Yinxiao Li. The paper introduces Calibrated Preference Optimization (CaPO), a new method for aligning text-to-image diffusion models using multiple reward models without requiring expensive…
-
Arxiv paper – Personalize Anything for Free with Diffusion Transformer
In this episode, we discuss Personalize Anything for Free with Diffusion Transformer by Haoran Feng, Zehuan Huang, Lin Li, Hairong Lv, Lu Sheng. The paper introduces *Personalize Anything*, a training-free framework for personalized image generation using diffusion transformers (DiTs). By replacing denoising tokens with those of a reference subject, the method enables zero-shot subject reconstruction…
-
Arxiv paper – Story-Adapter: A Training-free Iterative Framework for Long Story Visualization
In this episode, we discuss Story-Adapter: A Training-free Iterative Framework for Long Story Visualization by Jiawei Mao, Xiaoke Huang, Yunfei Xie, Yuanqi Chang, Mude Hui, Bingjie Xu, Yuyin Zhou. The paper tackles the challenge of generating coherent image sequences for long narratives using text-to-image diffusion models. It introduces Story-Adapter, a training-free and efficient framework that…
-
Arxiv paper – ReCamMaster: Camera-Controlled Generative Rendering from A Single Video
In this episode, we discuss ReCamMaster: Camera-Controlled Generative Rendering from A Single Video by Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, Di Zhang. ReCamMaster is a generative framework that modifies camera trajectories in existing videos by re-rendering scenes from new perspectives. It…
-
Arxiv paper – Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
In this episode, we discuss Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models by Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, Shaohui Lin. The paper aims to enhance the reasoning abilities of Multimodal Large Language Models (MLLMs) using reinforcement learning (RL). To overcome the lack…
-
Arxiv paper – MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks
In this episode, we discuss MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks by Jiacheng Chen, Tianhao Liang, Sherman Siu, Zhengqing Wang, Kai Wang, Yubo Wang, Yuansheng Ni, Wang Zhu, Ziyan Jiang, Bohan Lyu, Dongfu Jiang, Xuan He, Yuan Liu, Hexiang Hu, Xiang Yue, Wenhu Chen. The paper introduces MEGA-BENCH, a comprehensive evaluation suite…
-
Arxiv paper – TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models
In this episode, we discuss TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models by Mark YU, Wenbo Hu, Jinbo Xing, Ying Shan. TrajectoryCrafter is a new method that precisely redirects camera paths in monocular videos by separating view changes from content generation. It uses a dual-stream conditional video diffusion model that combines point…
-
Arxiv paper – PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving
In this episode, we discuss PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving by Mihir Parmar, Xin Liu, Palash Goyal, Yanfei Chen, Long Le, Swaroop Mishra, Hossein Mobahi, Jindong Gu, Zifeng Wang, Hootan Nakhost, Chitta Baral, Chen-Yu Lee, Tomas Pfister, Hamid Palangi. The paper introduces **PlanGEN**, a versatile agent…
-
Arxiv paper – VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing
In this episode, we discuss VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing by Xiangpeng Yang, Linchao Zhu, Hehe Fan, Yi Yang. The paper introduces VideoGrain, a zero-shot method that enhances multi-grained video editing by modulating space-time attention mechanisms for class-, instance-, and part-level modifications. It addresses challenges like semantic misalignment and feature coupling by…
-
Arxiv paper – ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models
In this episode, we discuss ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models by Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, Vatsal Raina, Hanyi Xiong, Vishaal Udandarao, Jingyi Lu, Shiyang Chen, Sam Purkis, Tianshuo Yan, Wenye Lin, Gyungin Shin,…
-
Arxiv paper – Teaching Language Models to Critique via Reinforcement Learning
In this episode, we discuss Teaching Language Models to Critique via Reinforcement Learning by Zhihui Xie, Jie chen, Liyu Chen, Weichao Mao, Jingjing Xu, Lingpeng Kong. The paper presents CTRL, a framework that uses reinforcement learning to train critic models which provide feedback for improving code generated by large language models without needing human input.…
-
Arxiv paper – PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling
In this episode, we discuss PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling by Avery Ma, Yangchen Pan, Amir-massoud Farahmand. The paper introduces PANDAS, a hybrid technique that enhances many-shot jailbreaking by altering fabricated dialogues with positive affirmations, negative demonstrations, and optimized adaptive sampling tailored to specific prompts. Experimental results on…
-
Arxiv paper – VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation
In this episode, we discuss VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation by Sixiao Zheng, Zimian Peng, Yanpeng Zhou, Yi Zhu, Hang Xu, Xiangru Huang, Yanwei Fu. The paper presents VidCRAFT3, a new framework for image-to-video generation that allows simultaneous control over camera motion, object movement, and lighting direction. It addresses previous limitations…
-
Arxiv paper – Heuristically Adaptive Diffusion-Model Evolutionary Strategy
In this episode, we discuss Heuristically Adaptive Diffusion-Model Evolutionary Strategy by Benedikt Hartl, Yanbo Zhang, Hananel Hazan, Michael Levin. The paper explores the connection between diffusion models and evolutionary algorithms, highlighting that both generate high-quality samples through iterative refinement of random initial states. By integrating deep learning-based diffusion models into evolutionary processes, the authors enhance…
-
Arxiv paper – Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
In this episode, we discuss Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach by Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Tom Goldstein. The paper presents a new language model architecture that enhances test-time computation by iteratively reasoning in latent space using…
-
Arxiv paper – EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
In this episode, we discuss EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents by Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang. The paper presents **EMBODIEDBENCH**, a comprehensive benchmark with 1,128 tasks across…
-
Arxiv paper – VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
In this episode, we discuss VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection by Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, Si Liu. The paper introduces VideoEspresso, a high-quality, large-scale VideoQA dataset that maintains essential spatial and temporal details…
-
Arxiv paper – VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models
In this episode, we discuss VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models by Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, Shelly Sheynin. Generative video models typically prioritize appearance accuracy over motion coherence, limiting their ability to capture realistic dynamics. The paper presents VideoJAM, a…
-
Arxiv paper – HunyuanVideo: A Systematic Framework For Large Video Generative Models
In this episode, we discuss HunyuanVideo: A Systematic Framework For Large Video Generative Models by Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan,…
-
Arxiv paper – s1: Simple test-time scaling
In this episode, we discuss s1: Simple test-time scaling by Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto. The paper introduces a straightforward method for test-time scaling in language models to enhance reasoning performance by utilizing additional computational resources during inference. The…
-
Arxiv paper – Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
In this episode, we discuss Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation by The authors of the paper are the **Hunyuan3D Team**. Specific contributor names are indicated to be listed at the end of the full report.. Hunyuan3D 2.0 is a large-scale 3D synthesis system featuring Hunyuan3D-DiT for generating detailed…