Preprint: Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

In this episode, we discuss Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale by Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz,
Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, Wei-Ning Hsu from Meta AI. The paper “Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale” presents a breakthrough in generative modeling for speech, addressing the lack of scalability and task generalization in current speech generative models. The authors introduce Voicebox, a non-autoregressive flow-matching model trained on over 50K hours of speech that can perform mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. Similar to large-scale generative models for language and vision, Voicebox can solve tasks not explicitly trained on through in-context learning.