arxiv preprint - Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts

In this episode we discuss Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts
by Ziyue Jiang, Jinglin Liu, Yi Ren, Jinzheng He, Chen Zhang, Zhenhui Ye, Pengfei Wei, Chunfeng Wang, Xiang Yin, Zejun Ma, Zhou Zhao. The paper discusses Mega-TTS 2, a text-to-speech model that can synthesize speech for unseen speakers using arbitrary-length prompts. Previous models had limitations with imitating natural speaking styles due to short prompts, but Mega-TTS 2 addresses this by introducing a timbre encoder and a prosody language model. The model also incorporates arbitrary-source prompts for enhanced prosody control and utilizes a phoneme-level duration model for in-context learning. Experimental results show that Mega-TTS 2 can synthesize identity-preserving speech with both short and long prompts.

arxiv preprint – Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts