Category: Uncategorized

  • arxiv preprint – MLP-Mixer: An all-MLP Architecture for Vision

    In this episode we discuss MLP-Mixer: An all-MLP Architecture for Vision by Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy. The paper presents MLP-Mixer, an architecture that relies solely on multi-layer perceptrons (MLPs) for image classification tasks, demonstrating that…

  • arxiv preprint – Nash Learning from Human Feedback

    In this episode we discuss Nash Learning from Human Feedback by Remi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev, Olivier Bachem, Daniel J. Mankowitz, Doina Precup, Bilal Piot from Google DeepMind. The paper introduces Nash Learning…

  • arxiv preprint – Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

    In this episode we discuss Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation by Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, Liefeng Bo. The paper presents a novel framework designed for character animation that synthesizes consistent and controllable videos from still images using diffusion models. It introduces a ReferenceNet that…

  • arxiv preprint – Knowledge is a Region in Weight Space for Fine-tuned Language Models

    In this episode we discuss Knowledge is a Region in Weight Space for Fine-tuned Language Models by Almog Gueta, Elad Venezian, Colin Raffel, Noam Slonim, Yoav Katz, Leshem Choshen. The paper investigates the relationships between different neural network models when trained on diverse datasets, focusing on their weight space and loss landscape. The study reveals…

  • arxiv preprint – MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

    In this episode we discuss MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training by Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel. The paper introduces MobileCLIP, a new efficient image-text model family optimized for mobile devices with a novel multi-modal reinforced training method that enhances accuracy without increasing on-device computational demands.…

  • arxiv preprint – Simplifying Transformer Blocks

    In this episode we discuss Simplifying Transformer Blocks by Bobby He, Thomas Hofmann. The paper studies the possibility of simplifying standard transformer blocks without reducing training speed by experimenting with the removal of certain components such as skip connections and normalization layers. Using signal propagation theory along with empirical research, the authors justify modifications that…

  • arxiv – Visual In-Context Prompting

    In this episode, we discuss Visual In-Context Prompting by Feng Li, Qing Jiang, Hao Zhang, Tianhe Ren, Shilong Liu, Xueyan Zou, Huaizhe Xu, Hongyang Li, Chunyuan Li, Jianwei Yang, Lei Zhang, Jianfeng Gao. This paper introduces a new framework for improving zero-shot learning capabilities in vision tasks called universal visual in-context prompting, which works by…

  • Arxiv Preprint – GAIA: a benchmark for General AI Assistants

    In this episode we discuss GAIA: a benchmark for General AI Assistants by Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, Thomas Scialom. The paper introduces GAIA, a benchmark designed to assess the capabilities of General AI Assistants in performing tasks that are simple for humans yet difficult for AIs, such as reasoning,…

  • Arxiv Preprint – DisCo: Disentangled Control for Realistic Human Dance Generation

    In this episode we discuss DisCo: Disentangled Control for Realistic Human Dance Generation by Tan Wang, Linjie Li, Kevin Lin, Yuanhao Zhai, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang. The paper discusses the challenges of generative AI in creating realistic human-centric dance content for social media, highlighting the need for models to…

  • Arxiv Preprint – Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation

    In this episode we discuss Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation by Eric Zelikman, Eliana Lorch, Lester Mackey, Adam Tauman Kalai. The paper discusses how a language-model-infused scaffolding program uses a seed “improver” program to iteratively improve itself by querying a language model multiple times and optimizing based on a utility function. The improved…

  • Arxiv Preprint – A General Theoretical Paradigm to Understand Learning from Human Preferences

    In this episode we discuss A General Theoretical Paradigm to Understand Learning from Human Preferences by Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos. The paper explores reinforcement learning from human preferences (RLHF) and proposes ΨPO, a new theoretical framework that directly utilizes pairwise preferences without relying on…

  • Arxiv Preprint – ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    In this episode we discuss ShareGPT4V: Improving Large Multi-Modal Models with Better Captions by Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, Dahua Lin. The ShareGPT4V dataset, with 1.2 million rich descriptive captions, has been created to enhance modality alignment in large multi-modal models (LMMs), offering greater diversity and…

  • ArXiv Preprint – S-LoRA: Serving Thousands of Concurrent LoRA Adapters

    In this episode we discuss S-LoRA: Serving Thousands of Concurrent LoRA Adapters by Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica. The paper introduces S-LoRA, a system for efficiently serving a large number of Low-Rank Adaptation (LoRA) language…

  • ArXiv Preprint – Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

    In this episode we discuss Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities by AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S. Ryoo, Victor Gomes, Anelia Angelova. The paper presents Mirasol3B, a multimodal model that handles the disparate natures of video, audio, and text modalities through separate autoregressive components, dividing the process according…

  • Arxiv Preprint – LCM-LoRA: A Universal Stable-Diffusion Acceleration Module

    In this episode we discuss LCM-LoRA: A Universal Stable-Diffusion Acceleration Module by Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von Platen, Apolinário Passos, Longbo Huang, Jian Li, Hang Zhao (Project page). The paper discusses the advancements in Latent Consistency Models (LCMs), which have shown great efficiency in text-to-image generation by being distilled from…

  • ArXiv Preprint – Fine-tuning Language Models for Factuality

    In this episode we discuss Fine-tuning Language Models for Factuality by Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D. Manning, Chelsea Finn. The paper presents a method to improve the factual accuracy of large pre-trained language models (LLMs) without human fact-checking. By utilizing recent advancements in natural language processing (NLP), such as judging the factuality…

  • arxiv preprint – Language Models can be Logical Solvers

    In this episode we discuss Language Models can be Logical Solvers by Jiazhan Feng, Ruochen Xu, Junheng Hao, Hiteshi Sharma, Yelong Shen, Dongyan Zhao, Weizhu Chen. The paper presents LOGIPT, a new language model designed to tackle complex logical reasoning by directly mimicking the reasoning process of logical solvers, which avoids errors caused by parsing…

  • ArXiv Preprint – Prompt Engineering a Prompt Engineer

    In this episode we discuss Prompt Engineering a Prompt Engineer by Qinyuan Ye, Maxamed Axmed, Reid Pryzant, Fereshte Khani. The paper presents PE2, an advanced method for automatically engineering prompts for large language models (LLMs), enabling them to perform better at complex tasks. By incorporating elements like a step-by-step reasoning template and verbalized optimization concepts…

  • arxiv preprint – CogVLM: Visual Expert for Pretrained Language Models

    In this episode we discuss CogVLM: Visual Expert for Pretrained Language Models by Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, Jie Tang. CogVLM is an open-source visual language foundation model that significantly…

  • ArXiv Preprint – De-Diffusion Makes Text a Strong Cross-Modal Interface

    In this episode we discuss De-Diffusion Makes Text a Strong Cross-Modal Interface by Chen Wei, Chenxi Liu, Siyuan Qiao, Zhishuai Zhang, Alan Yuille, Jiahui Yu. The paper introduces De-Diffusion, a new approach that uses text to represent images. An autoencoder is used to transform an image into text, which can be reconstructed back into the…

  • ArXiv Preprint – E3 TTS: Easy End-to-End Diffusion-based Text to Speech

    In this episode we discuss E3 TTS: Easy End-to-End Diffusion-based Text to Speech by Yuan Gao, Nobuyuki Morioka, Yu Zhang, Nanxin Chen. The paper introduces Easy End-to-End Diffusion-based Text to Speech (E3 TTS), an innovative text-to-speech model that converts text to audio using a diffusion process without the need for intermediate representations or alignment information.…

  • ArXiv Preprint – Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges

    In this episode we discuss Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges by Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, Huaxiu Yao. The study introduces the Bingo benchmark to analyze hallucination behavior in GPT-4V(ision), a model processing both visual and textual data. Hallucinations, categorized as either bias…

  • ArXiv Preprint – Learning From Mistakes Makes LLM Better Reasoner

    In this episode we discuss Learning From Mistakes Makes LLM Better Reasoner by Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou, Weizhu Chen. The paper introduces LEarning from MistAkes (LEMA), a method that improves large language models’ (LLMs) ability to solve math problems by fine-tuning them using GPT-4-generated mistake-correction data pairs. LEMA involves…

  • ArXiv Preprint – The Generative AI Paradox: ”What It Can Create, It May Not Understand”

    In this episode we discuss The Generative AI Paradox: “What It Can Create, It May Not Understand” by Peter West, Ximing Lu, Nouha Dziri, Faeze Brahman, Linjie Li, Jena D. Hwang, Liwei Jiang, Jillian Fisher, Abhilasha Ravichander, Khyathi Chandu, Benjamin Newman, Pang Wei Koh, Allyson Ettinger, Yejin Choi. The paper examines the paradox in generative…

  • ArXiv Preprint – TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise

    In this episode we discuss TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise by Nan He, Hanyu Lai, Chenyang Zhao, Zirui Cheng, Junting Pan, Ruoyu Qin, Ruofan Lu, Rui Lu, Yunchen Zhang, Gangming Zhao, Zhaohui Hou, Zhiyuan Huang, Shaoqing Lu, Ding Liang, Mingjie Zhan. The paper introduces TeacherLM, a series of…