A few months ago I brought TorToiSe-TTS repo, which made it easy to generate text-to-speech although it only worked with english models.
https://theroamingworkshop.cloud/b/en/2083/%f0%9f%90%a2tortoise-tts-ai-text-to-speech-generation/
But AI world is moving so fast that today I’m bringing an evolution that completely exceeds the previous post, with complex voice generation and cloning in a matter of seconds and multilingual: Coqui-AI TTS.
https://github.com/coqui-ai/TTS
Web version
If you're in a rush and don't want trouble, you can use the free huggingface space and get your cloned voice in a few seconds:
https://huggingface.co/spaces/coqui/xtts
- Write the text to be generated
- Select language
- Upload your reference file
- Configure the other options (tick the boxes: Cleanup Reference Voice, Do not use language auto-detect, Agree)
- Request cloning to the server (Send)
Installation
Another strength of Coqui-AI TTS is the almost instant installation:
- You'll need python > 3.9, < 3.12.
- RAM: not as much as for image generation. 4GB should be enough.
- Create a project folder, for example "text-2-speech". Using a Linux terminal:
mkdir text-2-speech - It's convenient to create a specific python environment to avoid package incompatibilities, so you need python3-venv. I'll create an environemtn called TTSenv:
cd text-2-speech
python3 -m venv TTSenv - Activate the environment in the terminal:
source TTSenv/bin/activate - If you only need voice generation (without cloning or training), install TTS directly with python:
pip install TTS - Otherwise, install the full repo from Coqui-AI TTS github:
git clone https://github.com/coqui-ai/TTS
cd TTS
pip install -e .[all]
Checking language models and voices
First thing you can do is to check the available models to transform text into voice in different languages.
Type the following in your terminal:
tts --list_models
Or filter the result with grep, for example to get spanish models:
tts --list_models | grep "/es"
Text to speech
With all this you're ready to turn text into speech in a matter of seconds and in the language of your choice.
In the previous terminal, write the following, specifying the right model name:
tts --text "Ahora puedo hablar en español!" --model_name "tts_models/es/css10/vits" --out_path output/tts-es.wav
Make sure that the output folder exists, then check your result. The first time you'll get several files downloaded, and you'll have to accept Coqui-AI license. Next, voice generation only takes a few seconds:
Voice cloning
Lastly, the most amazing feature of this model is the voice cloning from only a few seconds of audio recording.
Like in the previous post, I took some 30 seconds of Ultron's voice from the film Avengers: Age of Ultron.
Sample in spanish:
Sample in english:
Now, let's prepare a python script to set all needed parameters, which will do the following:
- Import torch and TTS
import torch
from TTS.api import TTS - Define memory device (cuda or cpu). Using cpu should be enough (cuda might probably crash).
device="cpu" - Define text to be generated.
txt="Voice generated from text" - Define the reference audio sample (a .wav file of about 30 seconds)
sample="/voice-folder/voice.wav" - Call to TTS model
tts1=TTS("model_name").to(device) - File creation
tts1.tts_to_file(txt, speaker_wav=sample, language="es", file_path="output-folder/output-file.wav")
I called a script TRW-clone.py looking like this:
import torch
from TTS.api import TTS
# Get device ('cuda' or 'cpu')
device="cpu"
#Define text
txt="Bienvenido a este nuevo artículo del blog. Disfruta de tu visita."
#txt="Welcome to this new block post... Enjoy your visit!"
#Define audio sample
sample="../my-voices/ultron-es/mix.wav"
#sample="../my-voices/ultron-en/mix.wav"
#Run cloning
tts1 = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
tts1.tts_to_file(txt, speaker_wav=sample, language="es", file_path="../output/ultron-es.wav")
Run it from the TTS folder where the repo was installed:
cd TTS
python3 TRW-clone.py
Results
Here I drop the results I got on my first tests.
Spanish:
English:
And with a couple of iterations you can get really amazing results.
Any doubts or comments you can still drop me a line on Twitter/X