Developer Samuel Vitorino releases Sopro, a lightweight 169M-parameter text-to-speech model with zero-shot voice cloning, trained on single GPU; runs 4x faster than real-time on CPU.

GitHub - samuel-vitorino/sopro: A lightweight text-to-speech model with zero-shot voice cloning

sopro_readme.mp4 Sopro TTS Sopro (from the Portuguese word for “breath/blow”) is a lightweight English text-to-speech model I trained as a side project. Sopro is composed of dilated convs (à la WaveNet) and lightweight cross-attention layers, instead of the common Transformer architecture. Even though Sopro is not SOTA across most voices and situations, I still think it’s a cool project made with a very low budget (trained on a single L40S GPU), and it can be improved with better data. Some of t...