Microsoft Unveils VibeVoice: Open-Source AI Generates 90-Minute, Multi-Speaker Podcasts with Natural Expressions and Cross-Lingual Capabilities

VibeVoice: A Frontier Open-Source Text-to-Speech Model

📄 Report · Code · 🤗 Hugging Face · Demo VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking. A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokeniz...