🐸Coqui-AI/TTS: ultra fast voice generation and cloning from multilingual text

A few months ago I brought TorToiSe-TTS repo, which made it easy to generate text-to-speech although it only worked with english models.

https://theroamingworkshop.cloud/b/en/2083/%f0%9f%90%a2tortoise-tts-ai-text-to-speech-generation/

But AI world is moving so fast that today I’m bringing an evolution that completely exceeds the previous post, with complex voice generation and cloning in a matter of seconds and multilingual: Coqui-AI TTS.

https://github.com/coqui-ai/TTS

Web version

If you're in a rush and don't want trouble, you can use the free huggingface space and get your cloned voice in a few seconds:

https://huggingface.co/spaces/coqui/xtts

Write the text to be generated
Select language
Upload your reference file
Configure the other options (tick the boxes: Cleanup Reference Voice, Do not use language auto-detect, Agree)
Request cloning to the server (Send)

Installation

Another strength of Coqui-AI TTS is the almost instant installation:

You'll need python > 3.9, < 3.12.
RAM: not as much as for image generation. 4GB should be enough.
Create a project folder, for example "text-2-speech". Using a Linux terminal:
mkdir text-2-speech
It's convenient to create a specific python environment to avoid package incompatibilities, so you need python3-venv. I'll create an environemtn called TTSenv:
cd text-2-speech
python3 -m venv TTSenv
Activate the environment in the terminal:
source TTSenv/bin/activate
If you only need voice generation (without cloning or training), install TTS directly with python:
pip install TTS
Otherwise, install the full repo from Coqui-AI TTS github:
git clone https://github.com/coqui-ai/TTS
cd TTS
pip install -e .[all]

Checking language models and voices

First thing you can do is to check the available models to transform text into voice in different languages.

Type the following in your terminal:

tts --list_models

No API token found for 🐸Coqui Studio voices - https://coqui.ai Visit 🔗https://app.coqui.ai/account to get one. Set it as an environment variable `export COQUI_STUDIO_TOKEN=` Name format: type/language/dataset/model 1: tts_models/multilingual/multi-dataset/xtts_v2 [already downloaded] 2: tts_models/multilingual/multi-dataset/xtts_v1.1 [already downloaded] 3: tts_models/multilingual/multi-dataset/your_tts 4: tts_models/multilingual/multi-dataset/bark [already downloaded] 5: tts_models/bg/cv/vits 6: tts_models/cs/cv/vits 7: tts_models/da/cv/vits 8: tts_models/et/cv/vits 9: tts_models/ga/cv/vits 10: tts_models/en/ek1/tacotron2 11: tts_models/en/ljspeech/tacotron2-DDC 12: tts_models/en/ljspeech/tacotron2-DDC_ph 13: tts_models/en/ljspeech/glow-tts 14: tts_models/en/ljspeech/speedy-speech 15: tts_models/en/ljspeech/tacotron2-DCA 16: tts_models/en/ljspeech/vits 17: tts_models/en/ljspeech/vits--neon 18: tts_models/en/ljspeech/fast_pitch 19: tts_models/en/ljspeech/overflow 20: tts_models/en/ljspeech/neural_hmm 21: tts_models/en/vctk/vits 22: tts_models/en/vctk/fast_pitch 23: tts_models/en/sam/tacotron-DDC 24: tts_models/en/blizzard2013/capacitron-t2-c50 25: tts_models/en/blizzard2013/capacitron-t2-c150_v2 26: tts_models/en/multi-dataset/tortoise-v2 27: tts_models/en/jenny/jenny 28: tts_models/es/mai/tacotron2-DDC [already downloaded] 29: tts_models/es/css10/vits [already downloaded] 30: tts_models/fr/mai/tacotron2-DDC 31: tts_models/fr/css10/vits 32: tts_models/uk/mai/glow-tts 33: tts_models/uk/mai/vits 34: tts_models/zh-CN/baker/tacotron2-DDC-GST 35: tts_models/nl/mai/tacotron2-DDC 36: tts_models/nl/css10/vits 37: tts_models/de/thorsten/tacotron2-DCA 38: tts_models/de/thorsten/vits 39: tts_models/de/thorsten/tacotron2-DDC 40: tts_models/de/css10/vits-neon 41: tts_models/ja/kokoro/tacotron2-DDC 42: tts_models/tr/common-voice/glow-tts 43: tts_models/it/mai_female/glow-tts 44: tts_models/it/mai_female/vits 45: tts_models/it/mai_male/glow-tts 46: tts_models/it/mai_male/vits 47: tts_models/ewe/openbible/vits 48: tts_models/hau/openbible/vits 49: tts_models/lin/openbible/vits 50: tts_models/tw_akuapem/openbible/vits 51: tts_models/tw_asante/openbible/vits 52: tts_models/yor/openbible/vits 53: tts_models/hu/css10/vits 54: tts_models/el/cv/vits 55: tts_models/fi/css10/vits 56: tts_models/hr/cv/vits 57: tts_models/lt/cv/vits 58: tts_models/lv/cv/vits 59: tts_models/mt/cv/vits 60: tts_models/pl/mai_female/vits 61: tts_models/pt/cv/vits 62: tts_models/ro/cv/vits 63: tts_models/sk/cv/vits 64: tts_models/sl/cv/vits 65: tts_models/sv/cv/vits 66: tts_models/ca/custom/vits 67: tts_models/fa/custom/glow-tts 68: tts_models/bn/custom/vits-male 69: tts_models/bn/custom/vits-female 70: tts_models/be/common-voice/glow-tts Name format: type/language/dataset/model 1: vocoder_models/universal/libri-tts/wavegrad 2: vocoder_models/universal/libri-tts/fullband-melgan [already downloaded] 3: vocoder_models/en/ek1/wavegrad 4: vocoder_models/en/ljspeech/multiband-melgan 5: vocoder_models/en/ljspeech/hifigan_v2 6: vocoder_models/en/ljspeech/univnet 7: vocoder_models/en/blizzard2013/hifigan_v2 8: vocoder_models/en/vctk/hifigan_v2 9: vocoder_models/en/sam/hifigan_v2 10: vocoder_models/nl/mai/parallel-wavegan 11: vocoder_models/de/thorsten/wavegrad 12: vocoder_models/de/thorsten/fullband-melgan 13: vocoder_models/de/thorsten/hifigan_v1 14: vocoder_models/ja/kokoro/hifigan_v1 15: vocoder_models/uk/mai/multiband-melgan 16: vocoder_models/tr/common-voice/hifigan 17: vocoder_models/be/common-voice/hifigan Name format: type/language/dataset/model 1: voice_conversion_models/multilingual/vctk/freevc24 [already downloaded]

Or filter the result with grep, for example to get spanish models:

tts --list_models | grep "/es"

28: tts_models/es/mai/tacotron2-DDC [already downloaded] 29: tts_models/es/css10/vits [already downloaded]

Text to speech

With all this you're ready to turn text into speech in a matter of seconds and in the language of your choice.

In the previous terminal, write the following, specifying the right model name:

tts --text "Ahora puedo hablar en español!" --model_name "tts_models/es/css10/vits" --out_path output/tts-es.wav

Make sure that the output folder exists, then check your result. The first time you'll get several files downloaded, and you'll have to accept Coqui-AI license. Next, voice generation only takes a few seconds:

Voice cloning

Lastly, the most amazing feature of this model is the voice cloning from only a few seconds of audio recording.

Like in the previous post, I took some 30 seconds of Ultron's voice from the film Avengers: Age of Ultron.

Sample in spanish:

Sample in english:

Now, let's prepare a python script to set all needed parameters, which will do the following:

Import torch and TTS
import torch from TTS.api import TTS
Define memory device (cuda or cpu). Using cpu should be enough (cuda might probably crash).
device="cpu"
Define text to be generated.
txt="Voice generated from text"
Define the reference audio sample (a .wav file of about 30 seconds)
sample="/voice-folder/voice.wav"
Call to TTS model
tts1=TTS("model_name").to(device)
File creation
tts1.tts_to_file(txt, speaker_wav=sample, language="es", file_path="output-folder/output-file.wav")

I called a script TRW-clone.py looking like this:

import torch
from TTS.api import TTS

# Get device ('cuda' or 'cpu')
device="cpu"

#Define text
txt="Bienvenido a este nuevo artículo del blog. Disfruta de tu visita."
#txt="Welcome to this new block post... Enjoy your visit!"

#Define audio sample
sample="../my-voices/ultron-es/mix.wav"
#sample="../my-voices/ultron-en/mix.wav"

#Run cloning
tts1 = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

tts1.tts_to_file(txt, speaker_wav=sample, language="es", file_path="../output/ultron-es.wav")

Run it from the TTS folder where the repo was installed:

cd TTS python3 TRW-clone.py

Results

Here I drop the results I got on my first tests.

Spanish:

English:

And with a couple of iterations you can get really amazing results.

Any doubts or comments you can still drop me a line on Twitter/X

🐦 @RoamingWorkshop