A few months ago I brought TorToiSe-TTS repo, which made it easy to generate text-to-speech although it only worked with english models.

https://theroamingworkshop.cloud/b/en/2083/%f0%9f%90%a2tortoise-tts-ai-text-to-speech-generation/

But AI world is moving so fast that today I’m bringing an evolution that completely exceeds the previous post, with complex voice generation and cloning in a matter of seconds and multilingual: Coqui-AI TTS.

https://github.com/coqui-ai/TTS

Web version

If you're in a rush and don't want trouble, you can use the free huggingface space and get your cloned voice in a few seconds:

https://huggingface.co/spaces/coqui/xtts

  1. Write the text to be generated
  2. Select language
  3. Upload your reference file
  4. Configure the other options (tick the boxes: Cleanup Reference Voice, Do not use language auto-detect, Agree)
  5. Request cloning to the server (Send)

Installation

Another strength of Coqui-AI TTS is the almost instant installation:

  • You'll need python > 3.9, < 3.12.
  • RAM: not as much as for image generation. 4GB should be enough.
  • Create a project folder, for example "text-2-speech". Using a Linux terminal:
    mkdir text-2-speech
  • It's convenient to create a specific python environment to avoid package incompatibilities, so you need python3-venv. I'll create an environemtn called TTSenv:
    cd text-2-speech
    python3 -m venv TTSenv
  • Activate the environment in the terminal:
    source TTSenv/bin/activate
  • If you only need voice generation (without cloning or training), install TTS directly with python:
    pip install TTS
  • Otherwise, install the full repo from Coqui-AI TTS github:
    git clone https://github.com/coqui-ai/TTS
    cd TTS
    pip install -e .[all]

Checking language models and voices

First thing you can do is to check the available models to transform text into voice in different languages.

Type the following in your terminal:

tts --list_models

No API token found for 🐸Coqui Studio voices - https://coqui.ai
Visit 🔗https://app.coqui.ai/account to get one.
Set it as an environment variable `export COQUI_STUDIO_TOKEN=`


Name format: type/language/dataset/model
1: tts_models/multilingual/multi-dataset/xtts_v2 [already downloaded]
2: tts_models/multilingual/multi-dataset/xtts_v1.1 [already downloaded]
3: tts_models/multilingual/multi-dataset/your_tts
4: tts_models/multilingual/multi-dataset/bark [already downloaded]
5: tts_models/bg/cv/vits
6: tts_models/cs/cv/vits
7: tts_models/da/cv/vits
8: tts_models/et/cv/vits
9: tts_models/ga/cv/vits
10: tts_models/en/ek1/tacotron2
11: tts_models/en/ljspeech/tacotron2-DDC
12: tts_models/en/ljspeech/tacotron2-DDC_ph
13: tts_models/en/ljspeech/glow-tts
14: tts_models/en/ljspeech/speedy-speech
15: tts_models/en/ljspeech/tacotron2-DCA
16: tts_models/en/ljspeech/vits
17: tts_models/en/ljspeech/vits--neon
18: tts_models/en/ljspeech/fast_pitch
19: tts_models/en/ljspeech/overflow
20: tts_models/en/ljspeech/neural_hmm
21: tts_models/en/vctk/vits
22: tts_models/en/vctk/fast_pitch
23: tts_models/en/sam/tacotron-DDC
24: tts_models/en/blizzard2013/capacitron-t2-c50
25: tts_models/en/blizzard2013/capacitron-t2-c150_v2
26: tts_models/en/multi-dataset/tortoise-v2
27: tts_models/en/jenny/jenny
28: tts_models/es/mai/tacotron2-DDC [already downloaded]
29: tts_models/es/css10/vits [already downloaded]
30: tts_models/fr/mai/tacotron2-DDC
31: tts_models/fr/css10/vits
32: tts_models/uk/mai/glow-tts
33: tts_models/uk/mai/vits
34: tts_models/zh-CN/baker/tacotron2-DDC-GST
35: tts_models/nl/mai/tacotron2-DDC
36: tts_models/nl/css10/vits
37: tts_models/de/thorsten/tacotron2-DCA
38: tts_models/de/thorsten/vits
39: tts_models/de/thorsten/tacotron2-DDC
40: tts_models/de/css10/vits-neon
41: tts_models/ja/kokoro/tacotron2-DDC
42: tts_models/tr/common-voice/glow-tts
43: tts_models/it/mai_female/glow-tts
44: tts_models/it/mai_female/vits
45: tts_models/it/mai_male/glow-tts
46: tts_models/it/mai_male/vits
47: tts_models/ewe/openbible/vits
48: tts_models/hau/openbible/vits
49: tts_models/lin/openbible/vits
50: tts_models/tw_akuapem/openbible/vits
51: tts_models/tw_asante/openbible/vits
52: tts_models/yor/openbible/vits
53: tts_models/hu/css10/vits
54: tts_models/el/cv/vits
55: tts_models/fi/css10/vits
56: tts_models/hr/cv/vits
57: tts_models/lt/cv/vits
58: tts_models/lv/cv/vits
59: tts_models/mt/cv/vits
60: tts_models/pl/mai_female/vits
61: tts_models/pt/cv/vits
62: tts_models/ro/cv/vits
63: tts_models/sk/cv/vits
64: tts_models/sl/cv/vits
65: tts_models/sv/cv/vits
66: tts_models/ca/custom/vits
67: tts_models/fa/custom/glow-tts
68: tts_models/bn/custom/vits-male
69: tts_models/bn/custom/vits-female
70: tts_models/be/common-voice/glow-tts

Name format: type/language/dataset/model
1: vocoder_models/universal/libri-tts/wavegrad
2: vocoder_models/universal/libri-tts/fullband-melgan [already downloaded]
3: vocoder_models/en/ek1/wavegrad
4: vocoder_models/en/ljspeech/multiband-melgan
5: vocoder_models/en/ljspeech/hifigan_v2
6: vocoder_models/en/ljspeech/univnet
7: vocoder_models/en/blizzard2013/hifigan_v2
8: vocoder_models/en/vctk/hifigan_v2
9: vocoder_models/en/sam/hifigan_v2
10: vocoder_models/nl/mai/parallel-wavegan
11: vocoder_models/de/thorsten/wavegrad
12: vocoder_models/de/thorsten/fullband-melgan
13: vocoder_models/de/thorsten/hifigan_v1
14: vocoder_models/ja/kokoro/hifigan_v1
15: vocoder_models/uk/mai/multiband-melgan
16: vocoder_models/tr/common-voice/hifigan
17: vocoder_models/be/common-voice/hifigan
Name format: type/language/dataset/model
1: voice_conversion_models/multilingual/vctk/freevc24 [already downloaded]

Or filter the result with grep, for example to get spanish models:

tts --list_models | grep "/es"

28: tts_models/es/mai/tacotron2-DDC [already downloaded]
29: tts_models/es/css10/vits [already downloaded]

Text to speech

With all this you're ready to turn text into speech in a matter of seconds and in the language of your choice.

In the previous terminal, write the following, specifying the right model name:

tts --text "Ahora puedo hablar en español!" --model_name "tts_models/es/css10/vits" --out_path output/tts-es.wav

Make sure that the output folder exists, then check your result. The first time you'll get several files downloaded, and you'll have to accept Coqui-AI license. Next, voice generation only takes a few seconds:

Voice cloning

Lastly, the most amazing feature of this model is the voice cloning from only a few seconds of audio recording.

Like in the previous post, I took some 30 seconds of Ultron's voice from the film Avengers: Age of Ultron.

Sample in spanish:

Sample in english:

Now, let's prepare a python script to set all needed parameters, which will do the following:

  • Import torch and TTS
    import torch
    from TTS.api import TTS
  • Define memory device (cuda or cpu). Using cpu should be enough (cuda might probably crash).
    device="cpu"
  • Define text to be generated.
    txt="Voice generated from text"
  • Define the reference audio sample (a .wav file of about 30 seconds)
    sample="/voice-folder/voice.wav"
  • Call to TTS model
    tts1=TTS("model_name").to(device)
  • File creation
    tts1.tts_to_file(txt, speaker_wav=sample, language="es", file_path="output-folder/output-file.wav")

I called a script TRW-clone.py looking like this:

import torch
from TTS.api import TTS

# Get device ('cuda' or 'cpu')
device="cpu"

#Define text
txt="Bienvenido a este nuevo artículo del blog. Disfruta de tu visita."
#txt="Welcome to this new block post... Enjoy your visit!"

#Define audio sample
sample="../my-voices/ultron-es/mix.wav"
#sample="../my-voices/ultron-en/mix.wav"

#Run cloning
tts1 = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

tts1.tts_to_file(txt, speaker_wav=sample, language="es", file_path="../output/ultron-es.wav")

Run it from the TTS folder where the repo was installed:

cd TTS
python3 TRW-clone.py

Results

Here I drop the results I got on my first tests.

Spanish:

English:

And with a couple of iterations you can get really amazing results.

Any doubts or comments you can still drop me a line on Twitter/X

🐦 @RoamingWorkshop