Category: Uncategorized
-
arxiv Preprint – DoG is SGD’s Best Friend: A Parameter-Free Dynamic Step Size Schedule
In this episode we discuss DoG is SGD’s Best Friend: A Parameter-Free Dynamic Step Size Schedule by Maor Ivgi, Oliver Hinder, Yair Carmon. The paper presents a dynamic SGD step size formula called DoG that does not require manual tuning. The authors analyze the DoG formula and demonstrate its strong convergence guarantees for stochastic convex…
-
CVPR 2023 – LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
In this episode we discuss LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling by Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu, Lijuan Wang. The paper presents LAVENDER, a unified video-language framework that uses Masked Language Modeling (MLM) as the common interface for pre-training and downstream tasks. LAVENDER simplifies the model…
-
NeurIPS 2022 – Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
In this episode we discuss Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners by Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, Shih-Fu Chang, Mohit Bansal, Heng Ji. VidIL is a few-shot video-language learner that combines image and language models to…
-
arxiv preprint – MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
In this episode we discuss MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action by Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang. The paper introduces MM-REACT, a system that combines ChatGPT with expert vision models to tackle challenging visual tasks. MM-REACT utilizes a…
-
arxiv preprint – 3D-LLM: Injecting the 3D World into Large Language Models
In this episode we discuss 3D-LLM: Injecting the 3D World into Large Language Models by Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, Chuang Gan. The paper proposes a new model called 3D-LLMs that integrates the 3D physical world into language models, allowing them to perform various 3D-related tasks such as…
-
arxiv preprint – Meta-Transformer: A Unified Framework for Multimodal Learning
In this episode we discuss Meta-Transformer: A Unified Framework for Multimodal Learning by Yiyuan Zhang, Kaixiong Gong, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Wanli Ouyang, Xiangyu Yue. The paper presents a framework called Meta-Transformer for processing multiple modalities in multimodal learning. It uses a frozen encoder for feature extraction across different modalities, including natural language,…
-
ICCV 2023 – Hidden Biases of End-to-End Driving Models
In this episode we discuss Hidden Biases of End-to-End Driving Models by Bernhard Jaeger, Kashyap Chitta, Andreas Geiger. The paper discusses biases commonly found in state-of-the-art end-to-end driving systems, particularly in the context of CARLA. The first bias is a preference for target point following for lateral recovery, while the second bias involves averaging multimodal…
-
arxiv preprint – Retentive Network: A Successor to Transformer for Large Language Models
In this episode we discuss Retentive Network: A Successor to Transformer for Large Language Models by Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, Furu Wei. The paper introduces RETNET as a successor to the Transformer architecture for language models. RETNET utilizes a retention mechanism that supports parallel, recurrent,…
-
arxiv preprint – Challenges and Applications of Large Language Models
In this episode we discuss Challenges and Applications of Large Language Models by Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, Robert McHardy. The paper provides a systematic analysis of the challenges and applications of Large Language Models (LLMs). In the Challenges section, it discusses obstacles such as dataset complexity, high training costs,…
-
ICML 2023 – Self-Repellent Random Walks on General Graphs — Achieving Minimal Sampling Variance via Nonlinear Markov Chains
In this episode we discuss Self-Repellent Random Walks on General Graphs — Achieving Minimal Sampling Variance via Nonlinear Markov Chains by Vishwaraj Doshi, Jie Hu, Do Young Eun. This paper introduces self-repellent random walks (SRRWs) as a way to improve sampling efficiency in Markov chain Monte Carlo (MCMC) procedures. It proves that the SRRWs converge…
-
CVPR 2023 – MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
In this episode we discuss MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering by Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yi Yang, Mike Zheng Shou. The paper introduces a model called MIST for long-form VideoQA, which addresses challenges like multi-event reasoning, interactions among visual concepts, and causality reasoning. MIST decomposes spatial-temporal…
-
arxiv preprint – Deliberate then Generate: Enhanced Prompting Framework for Text Generation
In this episode we discuss Deliberate then Generate: Enhanced Prompting Framework for Text Generation by Bei Li, Rui Wang, Junliang Guo, Kaitao Song, Xu Tan, Hany Hassan, Arul Menezes, Tong Xiao, Jiang Bian, JingBo Zhu. The paper presents a new prompting framework called Deliberate then Generate (DTG) for text generation tasks using large language models.…
-
arxiv preprint – Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts
In this episode we discuss Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts by Ziyue Jiang, Jinglin Liu, Yi Ren, Jinzheng He, Chen Zhang, Zhenhui Ye, Pengfei Wei, Chunfeng Wang, Xiang Yin, Zejun Ma, Zhou Zhao. The paper discusses Mega-TTS 2, a text-to-speech model that can synthesize speech for unseen speakers using arbitrary-length prompts.…
-
ICLR 2023 – Copy Is All You Need
In this episode we discuss Copy Is All You Need by Tian Lan, Deng Cai, Yan Wang, Heyan Huang, Xian-Ling Mao. The paper presents a novel approach to text generation by using copy-and-paste operations from an existing text collection instead of selecting from a fixed vocabulary. Contextualized representations of text segments are computed and indexed…
-
arxiv preprint – NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis
In this episode we discuss NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis by Nilesh Kulkarni, Davis Rempe, Kyle Genova, Abhijit Kundu, Justin Johnson, David Fouhey, Leonidas Guibas. The paper presents a method called NIFTY, which utilizes a neural interaction field to generate 3D human motions interacting with objects in a scene. The…
-
ICCV 2023 – DreamTeacher: Pretraining Image Backbones with Deep Generative Models
In this episode we discuss DreamTeacher: Pretraining Image Backbones with Deep Generative Models by Daiqing Li, Huan Ling, Amlan Kar, David Acuna, Seung Wook Kim, Karsten Kreis, Antonio Torralba, Sanja Fidler. This paper presents DreamTeacher, a self-supervised feature representation learning framework that utilizes generative networks to pre-train image backbones. The authors propose two methods of…
-
arxiv preprint – Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models
In this episode we discuss Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models by Xuhui Jia, Yang Zhao, Kelvin C. K. Chan, Yandong Li, Han Zhang, Boqing Gong, Tingbo Hou, Huisheng Wang, Yu-Chuan Su. This paper presents a method for generating customized images based on user specifications. The approach uses an encoder…
-
arxiv preprint – LightGlue: Local Feature Matching at Light Speed
In this episode we discuss LightGlue: Local Feature Matching at Light Speed by Philipp Lindenberger, Paul-Edouard Sarlin, Marc Pollefeys. The paper presents LightGlue, a deep neural network that matches local features across images. LightGlue is more efficient in terms of memory and computation, more accurate, and easier to train compared to the state-of-the-art model. It…
-
arxiv preprint – VanillaNet: the Power of Minimalism in Deep Learning
In this episode we discuss VanillaNet: the Power of Minimalism in Deep Learning by Hanting Chen, Yunhe Wang, Jianyuan Guo, Dacheng Tao. The paper introduces VanillaNet, a neural network architecture that prioritizes simplicity and minimalism. It avoids complex operations like self-attention and uses compact and straightforward layers. Experimental results demonstrate that VanillaNet performs comparably to…
-
arxiv preprint – Secrets of RLHF in Large Language Models Part I: PPO
In this episode we discuss Secrets of RLHF in Large Language Models Part I: PPO by Rui Zheng, Shihan Dou, Songyang Gao, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Limao Xiong, Lu Chen, Zhiheng Xi, Yuhao Zhou, Nuo Xu, Wenbin Lai, Minghao Zhu, Rongxiang Weng, Wensen Cheng, Cheng Chang, Zhangyue Yin, Yuan…
-
arxiv preprint – NILUT: Conditional Neural Implicit 3D Lookup Tables for Image Enhancement
In this episode we discuss NILUT: Conditional Neural Implicit 3D Lookup Tables for Image Enhancement by Marcos V. Conde, Javier Vazquez-Corral, Michael S. Brown, Radu Timofte. The paper introduces NILUT, a method that uses neural networks to enhance images using 3D lookup tables (3D LUTs). Traditional 3D LUTs are memory-intensive, so NILUT offers an alternative…
-
arxiv preprint – Large Language Models as General Pattern Machines
In this episode we discuss Large Language Models as General Pattern Machines by Suvir Mirchandani, Fei Xia, Pete Florence, Brian Ichter, Danny Driess, Montserrat Gonzalez Arenas, Kanishka Rao, Dorsa Sadigh, Andy Zeng. The paper discusses the capabilities of pre-trained large language models (LLMs) in completing complex token sequences. The study shows that LLMs can effectively…
-
arxiv preprint – Lost in the Middle: How Language Models Use Long Contexts
In this episode we discuss Lost in the Middle: How Language Models Use Long Contexts by Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang. This paper examines the impact of context length on the performance of language models in tasks such as multi-document question answering and key-value retrieval.…
-
arxiv preprint – LongNet: Scaling Transformers to 1,000,000,000 Tokens
In this episode we discuss LongNet: Scaling Transformers to 1,000,000,000 Tokens by Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Furu Wei. The paper introduces LONGNET, a variant of the Transformer model that addresses the challenge of scaling sequence length in large language models. LONGNET utilizes dilated attention to exponentially expand…
-
arxiv preprint – DisCo: Disentangled Control for Referring Human Dance Generation in Real World
In this episode we discuss DisCo: Disentangled Control for Referring Human Dance Generation in Real World by Tan Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang. The paper introduces a new problem setting in generating realistic dance sequences called Referring Human Dance Generation. The authors emphasize three important…