Category: Uncategorized
-
arxiv preprint – Weight subcloning: direct initialization of transformers using larger pretrained ones
In this episode we discuss Weight subcloning: direct initialization of transformers using larger pretrained ones by Mohammad Samragh, Mehrdad Farajtabar, Sachin Mehta, Raviteja Vemulapalli, Fartash Faghri, Devang Naik, Oncel Tuzel, Mohammad Rastegari. The paper introduces a new method called weight subcloning to expedite the training of small transformer models by initializing them with weights from…
-
arxiv preprint – Compositional Abilities Emerge Multiplicatively: Exploring Diffusion Models on a Synthetic Task
In this episode we discuss Compositional Abilities Emerge Multiplicatively: Exploring Diffusion Models on a Synthetic Task by Maya Okawa, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka. The paper investigates how conditional diffusion models generalize compositionally by studying their ability to generate novel data combinations within a controlled synthetic environment. Key discoveries include that compositional…
-
arxiv preprint – LLM in a flash: Efficient Large Language Model Inference with Limited Memory
In this episode, we discuss LLM in a flash: Efficient Large Language Model Inference with Limited Memory by Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar. The paper introduces an approach to operate large language models (LLMs) efficiently on devices with limited DRAM by using…
-
arxiv preprint – The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction
In this episode, we discuss The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction by Pratyusha Sharma, Jordan T. Ash, Dipendra Misra. The paper presents Layer-Selective Rank Reduction (LASER), an innovative method that enhances Transformer-based Large Language Models (LLMs) by reducing higher-order features in their weight matrices post-training, without adding…
-
arxiv preprint – DreaMoving: A Human Video Generation Framework based on Diffusion Models
In this episode we discuss DreaMoving: A Human Video Generation Framework based on Diffusion Models by Mengyang Feng, Jinlin Liu, Kai Yu, Yuan Yao, Zheng Hui, Xiefan Guo, Xianhui Lin, Haolan Xue, Chen Shi, Xiaowen Li, Aojie Li, Xiaoyang Kang, Biwen Lei, Miaomiao Cui, Peiran Ren, Xuansong Xie. DreaMoving is a framework that uses diffusion…
-
arxiv preprint – Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
In this episode we discuss Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution by Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer, Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim Alabdulmohsin, Avital Oliver, Piotr Padlewski, Alexey Gritsenko, Mario Lučić, Neil Houlsby. The paper introduces NaViT (Native Resolution…
-
arxiv preprint – UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
In this episode, we discuss UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces by Jiannan Wu, Yi Jiang, Bin Yan, Huchuan Lu, Zehuan Yuan, Ping Luo. The paper introduces UniRef++, a unified architecture designed to address four reference-based object segmentation tasks: referring image segmentation (RIS), few-shot image segmentation (FSS), referring video object segmentation…
-
arxiv preprint – LongNet: Scaling Transformers to 1,000,000,000 Tokens
In this episode we discuss LongNet: Scaling Transformers to 1,000,000,000 Tokens by Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Nanning Zheng, Furu Wei. LONGNET is a new Transformer variant that allows for efficient processing of sequences over 1 billion tokens long using a novel dilated attention mechanism. This mechanism provides…
-
arxiv preprint – MotionCtrl: A Unified and Flexible Motion Controller for Video Generation
In this episode, we discuss MotionCtrl: A Unified and Flexible Motion Controller for Video Generation by Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, Ying Shan. The study introduces MotionCtrl, a novel approach for video generation that can separately regulate camera and object motions, addressing limitations in previous methodologies that lacked…
-
arxiv preprint – Model-tuning Via Prompts Makes NLP Models Adversarially Robust
In this episode we discuss Model-tuning Via Prompts Makes NLP Models Adversarially Robust by Mrigank Raman, Pratyush Maini, J. Zico Kolter, Zachary C. Lipton, Danish Pruthi. The discussed paper presents a new method called Model-tuning Via Prompts (MVP) that significantly improves the adversarial robustness of pretrained language models over the standard multilayer perceptron fine-tuning (MLP-FT)…
-
arxiv preprint – Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation
In this episode we discuss Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation by Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, Konrad Schindler. The paper presents Marigold, a new method for monocular depth estimation that utilizes the learned priors from generative diffusion models, specifically derived from Stable Diffusion. Marigold is affine-invariant…
-
arxiv preprint – Instruction-tuning Aligns LLMs to the Human Brain
In this episode we discuss Instruction-tuning Aligns LLMs to the Human Brain by Khai Loong Aw, Syrielle Montariol, Badr AlKhamissi, Martin Schrimpf, Antoine Bosselut. The paper examines whether instruction-tuning, a method for fine-tuning large language models (LLMs), makes their processing more human-like through two metrics: brain alignment and behavioral alignment. Results indicate instruction-tuning increases brain…
-
arxiv preprint – WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia
In this episode we discuss WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia by Sina J. Semnani, Violet Z. Yao, Heidi C. Zhang, Monica S. Lam. The paper introduces WikiChat, a chatbot that uses a few-shot Large Language Model (LLM) grounded in Wikipedia to provide accurate, engaging responses with…
-
arxiv preprint – DemoFusion: Democratising High-Resolution Image Generation With No $$$
In this episode we discuss DemoFusion: Democratising High-Resolution Image Generation With No $$$ by Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, Zhanyu Ma. The paper introduces DemoFusion, a framework designed to enhance open-source Latent Diffusion Models (LDMs) for higher-resolution image generation. It incorporates Progressive Upscaling, Skip Residual, and Dilated Sampling to improve image quality…
-
arxiv preprint – Recommender Systems with Generative Retrieval
In this episode we discuss Recommender Systems with Generative Retrieval by Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan H. Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Q. Tran, Jonah Samost, Maciej Kula, Ed H. Chi, Maheswaran Sathiamoorthy. The paper presents a novel generative approach for large-scale retrieval in recommender systems, where a…
-
arxiv preprint – Mamba: Linear-Time Sequence Modeling with Selective State Spaces
In this episode we discuss Mamba: Linear-Time Sequence Modeling with Selective State Spaces by Albert Gu, Tri Dao. The paper presents Mamba, an innovative neural network architecture that outperforms traditional Transformer models, especially in handling very long sequences. Mamba’s design incorporates selective structured state space models (SSMs) whose parameters depend on input tokens, enabling content-based…
-
arxiv preprint – Block-State Transformers
In this episode we discuss Block-State Transformers by Mahan Fathi, Jonathan Pilault, Orhan Firat, Christopher Pal, Pierre-Luc Bacon, Ross Goroshin. The paper introduces the Block-State Transformer (BST) architecture that merges state space models and block-wise attention to effectively capture long-range dependencies and improve performance on language modeling tasks. The BST incorporates an SSM sublayer for…
-
arxiv preprint – Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns
In this episode we discuss Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns by Brian DuSell, David Chiang. The paper introduces stack attention, a novel attention mechanism that incorporates the concept of stacks to help recognize hierarchical and nested syntactic structures, which traditional scaled dot-product attention fails to handle effectively. Two versions…
-
arxiv preprint – LooseControl: Lifting ControlNet for Generalized Depth Conditioning
In this episode we discuss LooseControl: Lifting ControlNet for Generalized Depth Conditioning by Shariq Farooq Bhat, Niloy J. Mitra, Peter Wonka. LOOSECONTROL is introduced as a novel method for depth-conditioned image generation that is less reliant on detailed depth maps, unlike the state-of-the-art ControlNet. It allows for content creation by specifying scene boundaries or 3D…
-
Announcement: AI Breakdown Youtube Channel
Welcome back to AI Breakdown! In this special announcement, your hosts Megan and Ray share exciting news – we’re expanding to YouTube! This new platform will add a visual dimension to our discussions, bringing AI papers to life with figures, tables, and results. While the podcast will continue as usual, the YouTube channel will offer…
-
arxiv preprint – OneLLM: One Framework to Align All Modalities with Language
In this episode we discuss OneLLM: One Framework to Align All Modalities with Language by Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, Xiangyu Yue. The paper introduces OneLLM, a multimodal large language model that unifies the encoding of eight different modalities to language via a single…
-
arxiv preprint – The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning
In this episode we discuss The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning by Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavatula, Yejin Choi. The paper discusses the effectiveness of traditional alignment tuning methods for large language models (LLMs) and introduces a new, simple tuning-free…
-
arxiv – MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
In this episode, we discuss MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI by Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao…
-
arxiv preprint – MLP-Mixer: An all-MLP Architecture for Vision
In this episode we discuss MLP-Mixer: An all-MLP Architecture for Vision by Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy. The paper presents MLP-Mixer, an architecture that relies solely on multi-layer perceptrons (MLPs) for image classification tasks, demonstrating that…
-
arxiv preprint – Training Chain-of-Thought via Latent-Variable Inference
In this episode we discuss Training Chain-of-Thought via Latent-Variable Inference by Du Phan, Matthew D. Hoffman, David Dohan, Sholto Douglas, Tuan Anh Le, Aaron Parisi, Pavel Sountsov, Charles Sutton, Sharad Vikram, Rif A. Saurous. The paper introduces a fine-tuning strategy for large language models that improves their problem-solving accuracy by focusing on maximizing the probability…