Category: Uncategorized
-
arxiv preprint – Meta-Transformer: A Unified Framework for Multimodal Learning
In this episode we discuss Meta-Transformer: A Unified Framework for Multimodal Learning by Yiyuan Zhang, Kaixiong Gong, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Wanli Ouyang, Xiangyu Yue. The paper presents a framework called Meta-Transformer for processing multiple modalities in multimodal learning. It uses a frozen encoder for feature extraction across different modalities, including natural language,…
-
ICCV 2023 – Hidden Biases of End-to-End Driving Models
In this episode we discuss Hidden Biases of End-to-End Driving Models by Bernhard Jaeger, Kashyap Chitta, Andreas Geiger. The paper discusses biases commonly found in state-of-the-art end-to-end driving systems, particularly in the context of CARLA. The first bias is a preference for target point following for lateral recovery, while the second bias involves averaging multimodal…
-
arxiv preprint – Retentive Network: A Successor to Transformer for Large Language Models
In this episode we discuss Retentive Network: A Successor to Transformer for Large Language Models by Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, Furu Wei. The paper introduces RETNET as a successor to the Transformer architecture for language models. RETNET utilizes a retention mechanism that supports parallel, recurrent,…
-
arxiv preprint – Challenges and Applications of Large Language Models
In this episode we discuss Challenges and Applications of Large Language Models by Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, Robert McHardy. The paper provides a systematic analysis of the challenges and applications of Large Language Models (LLMs). In the Challenges section, it discusses obstacles such as dataset complexity, high training costs,…
-
ICML 2023 – Self-Repellent Random Walks on General Graphs — Achieving Minimal Sampling Variance via Nonlinear Markov Chains
In this episode we discuss Self-Repellent Random Walks on General Graphs — Achieving Minimal Sampling Variance via Nonlinear Markov Chains by Vishwaraj Doshi, Jie Hu, Do Young Eun. This paper introduces self-repellent random walks (SRRWs) as a way to improve sampling efficiency in Markov chain Monte Carlo (MCMC) procedures. It proves that the SRRWs converge…
-
CVPR 2023 – MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
In this episode we discuss MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering by Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yi Yang, Mike Zheng Shou. The paper introduces a model called MIST for long-form VideoQA, which addresses challenges like multi-event reasoning, interactions among visual concepts, and causality reasoning. MIST decomposes spatial-temporal…
-
arxiv preprint – Deliberate then Generate: Enhanced Prompting Framework for Text Generation
In this episode we discuss Deliberate then Generate: Enhanced Prompting Framework for Text Generation by Bei Li, Rui Wang, Junliang Guo, Kaitao Song, Xu Tan, Hany Hassan, Arul Menezes, Tong Xiao, Jiang Bian, JingBo Zhu. The paper presents a new prompting framework called Deliberate then Generate (DTG) for text generation tasks using large language models.…
-
arxiv preprint – Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts
In this episode we discuss Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts by Ziyue Jiang, Jinglin Liu, Yi Ren, Jinzheng He, Chen Zhang, Zhenhui Ye, Pengfei Wei, Chunfeng Wang, Xiang Yin, Zejun Ma, Zhou Zhao. The paper discusses Mega-TTS 2, a text-to-speech model that can synthesize speech for unseen speakers using arbitrary-length prompts.…
-
ICLR 2023 – Copy Is All You Need
In this episode we discuss Copy Is All You Need by Tian Lan, Deng Cai, Yan Wang, Heyan Huang, Xian-Ling Mao. The paper presents a novel approach to text generation by using copy-and-paste operations from an existing text collection instead of selecting from a fixed vocabulary. Contextualized representations of text segments are computed and indexed…
-
arxiv preprint – NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis
In this episode we discuss NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis by Nilesh Kulkarni, Davis Rempe, Kyle Genova, Abhijit Kundu, Justin Johnson, David Fouhey, Leonidas Guibas. The paper presents a method called NIFTY, which utilizes a neural interaction field to generate 3D human motions interacting with objects in a scene. The…
-
ICCV 2023 – DreamTeacher: Pretraining Image Backbones with Deep Generative Models
In this episode we discuss DreamTeacher: Pretraining Image Backbones with Deep Generative Models by Daiqing Li, Huan Ling, Amlan Kar, David Acuna, Seung Wook Kim, Karsten Kreis, Antonio Torralba, Sanja Fidler. This paper presents DreamTeacher, a self-supervised feature representation learning framework that utilizes generative networks to pre-train image backbones. The authors propose two methods of…
-
arxiv preprint – Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models
In this episode we discuss Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models by Xuhui Jia, Yang Zhao, Kelvin C. K. Chan, Yandong Li, Han Zhang, Boqing Gong, Tingbo Hou, Huisheng Wang, Yu-Chuan Su. This paper presents a method for generating customized images based on user specifications. The approach uses an encoder…
-
arxiv preprint – LightGlue: Local Feature Matching at Light Speed
In this episode we discuss LightGlue: Local Feature Matching at Light Speed by Philipp Lindenberger, Paul-Edouard Sarlin, Marc Pollefeys. The paper presents LightGlue, a deep neural network that matches local features across images. LightGlue is more efficient in terms of memory and computation, more accurate, and easier to train compared to the state-of-the-art model. It…
-
arxiv preprint – VanillaNet: the Power of Minimalism in Deep Learning
In this episode we discuss VanillaNet: the Power of Minimalism in Deep Learning by Hanting Chen, Yunhe Wang, Jianyuan Guo, Dacheng Tao. The paper introduces VanillaNet, a neural network architecture that prioritizes simplicity and minimalism. It avoids complex operations like self-attention and uses compact and straightforward layers. Experimental results demonstrate that VanillaNet performs comparably to…
-
arxiv preprint – Secrets of RLHF in Large Language Models Part I: PPO
In this episode we discuss Secrets of RLHF in Large Language Models Part I: PPO by Rui Zheng, Shihan Dou, Songyang Gao, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Limao Xiong, Lu Chen, Zhiheng Xi, Yuhao Zhou, Nuo Xu, Wenbin Lai, Minghao Zhu, Rongxiang Weng, Wensen Cheng, Cheng Chang, Zhangyue Yin, Yuan…
-
arxiv preprint – NILUT: Conditional Neural Implicit 3D Lookup Tables for Image Enhancement
In this episode we discuss NILUT: Conditional Neural Implicit 3D Lookup Tables for Image Enhancement by Marcos V. Conde, Javier Vazquez-Corral, Michael S. Brown, Radu Timofte. The paper introduces NILUT, a method that uses neural networks to enhance images using 3D lookup tables (3D LUTs). Traditional 3D LUTs are memory-intensive, so NILUT offers an alternative…
-
arxiv preprint – Large Language Models as General Pattern Machines
In this episode we discuss Large Language Models as General Pattern Machines by Suvir Mirchandani, Fei Xia, Pete Florence, Brian Ichter, Danny Driess, Montserrat Gonzalez Arenas, Kanishka Rao, Dorsa Sadigh, Andy Zeng. The paper discusses the capabilities of pre-trained large language models (LLMs) in completing complex token sequences. The study shows that LLMs can effectively…
-
arxiv preprint – Lost in the Middle: How Language Models Use Long Contexts
In this episode we discuss Lost in the Middle: How Language Models Use Long Contexts by Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang. This paper examines the impact of context length on the performance of language models in tasks such as multi-document question answering and key-value retrieval.…
-
arxiv preprint – LongNet: Scaling Transformers to 1,000,000,000 Tokens
In this episode we discuss LongNet: Scaling Transformers to 1,000,000,000 Tokens by Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Furu Wei. The paper introduces LONGNET, a variant of the Transformer model that addresses the challenge of scaling sequence length in large language models. LONGNET utilizes dilated attention to exponentially expand…
-
arxiv preprint – DisCo: Disentangled Control for Referring Human Dance Generation in Real World
In this episode we discuss DisCo: Disentangled Control for Referring Human Dance Generation in Real World by Tan Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang. The paper introduces a new problem setting in generating realistic dance sequences called Referring Human Dance Generation. The authors emphasize three important…
-
arxiv preprint – Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting
In this episode we discuss Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting by Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, Michael Bendersky. The paper introduces Pairwise Ranking Prompting (PRP) as a technique to improve document ranking using Large…
-
arxiv preprint – LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
In this episode we discuss LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding by Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, Tong Sun. The paper introduces LLaVAR, an enhanced visual instruction tuning method for text-rich image understanding. The method addresses the limitation of existing pipelines in comprehending textual details…
-
arxiv preprint – Generate Anything Anywhere in Any Scene
In this episode we discuss Generate Anything Anywhere in Any Scene by Yuheng Li, Haotian Liu, Yangming Wen, Yong Jae Lee. The paper proposes a data augmentation training strategy for personalized object generation in text-to-image diffusion models. They also introduce a plug-and-play adapter layers approach to control the location and size of the generated personalized…
-
CVPR 2023 – Consistent View Synthesis with Pose-Guided Diffusion Models
In this episode we discuss Consistent View Synthesis with Pose-Guided Diffusion Models by Hung-Yu Tseng, Qinbo Li, Changil Kim, Suhib Alsisan, Jia-Bin Huang, Johannes Kopf. The paper proposes a new technique for synthesizing novel views from a single image for virtual reality applications. The proposed method, called pose-guided diffusion, generates consistent and high-quality views from…
-
arxiv preprint – BEDLAM: A Synthetic Dataset of Bodies Exhibiting Detailed Lifelike Animated Motion
In this episode we discuss BEDLAM: A Synthetic Dataset of Bodies Exhibiting Detailed Lifelike Animated Motion by Michael J. Black, Priyanka Patel, Joachim Tesch, Jinlong Yang. This paper presents BEDLAM, a large-scale synthetic dataset for 3D human pose and shape estimation. Unlike previous datasets, BEDLAM is realistic and diverse, featuring monocular RGB videos with ground-truth…