Tag: tts

🐸Coqui-AI/TTS: ultra fast voice generation and cloning from multilingual text

19/11/2023 / TRW

A few months ago I brought TorToiSe-TTS repo, which made it easy to generate text-to-speech although it only worked with english models.

https://theroamingworkshop.cloud/b/en/2083/%f0%9f%90%a2tortoise-tts-ai-text-to-speech-generation/

But AI world is moving so fast that today I’m bringing an evolution that completely exceeds the previous post, with complex voice generation and cloning in a matter of seconds and multilingual: Coqui-AI TTS.

https://github.com/coqui-ai/TTS

Web version

If you're in a rush and don't want trouble, you can use the free huggingface space and get your cloned voice in a few seconds:

https://huggingface.co/spaces/coqui/xtts

Write the text to be generated
Select language
Upload your reference file
Configure the other options (tick the boxes: Cleanup Reference Voice, Do not use language auto-detect, Agree)
Request cloning to the server (Send)

Installation

Another strength of Coqui-AI TTS is the almost instant installation:

You'll need python > 3.9, < 3.12.
RAM: not as much as for image generation. 4GB should be enough.
Create a project folder, for example "text-2-speech". Using a Linux terminal:
mkdir text-2-speech
It's convenient to create a specific python environment to avoid package incompatibilities, so you need python3-venv. I'll create an environemtn called TTSenv:
cd text-2-speech
python3 -m venv TTSenv
Activate the environment in the terminal:
source TTSenv/bin/activate
If you only need voice generation (without cloning or training), install TTS directly with python:
pip install TTS
Otherwise, install the full repo from Coqui-AI TTS github:
git clone https://github.com/coqui-ai/TTS
cd TTS
pip install -e .[all]

Checking language models and voices

First thing you can do is to check the available models to transform text into voice in different languages.

Type the following in your terminal:

tts --list_models

No API token found for 🐸Coqui Studio voices - https://coqui.ai Visit 🔗https://app.coqui.ai/account to get one. Set it as an environment variable `export COQUI_STUDIO_TOKEN=` Name format: type/language/dataset/model 1: tts_models/multilingual/multi-dataset/xtts_v2 [already downloaded] 2: tts_models/multilingual/multi-dataset/xtts_v1.1 [already downloaded] 3: tts_models/multilingual/multi-dataset/your_tts 4: tts_models/multilingual/multi-dataset/bark [already downloaded] 5: tts_models/bg/cv/vits 6: tts_models/cs/cv/vits 7: tts_models/da/cv/vits 8: tts_models/et/cv/vits 9: tts_models/ga/cv/vits 10: tts_models/en/ek1/tacotron2 11: tts_models/en/ljspeech/tacotron2-DDC 12: tts_models/en/ljspeech/tacotron2-DDC_ph 13: tts_models/en/ljspeech/glow-tts 14: tts_models/en/ljspeech/speedy-speech 15: tts_models/en/ljspeech/tacotron2-DCA 16: tts_models/en/ljspeech/vits 17: tts_models/en/ljspeech/vits--neon 18: tts_models/en/ljspeech/fast_pitch 19: tts_models/en/ljspeech/overflow 20: tts_models/en/ljspeech/neural_hmm 21: tts_models/en/vctk/vits 22: tts_models/en/vctk/fast_pitch 23: tts_models/en/sam/tacotron-DDC 24: tts_models/en/blizzard2013/capacitron-t2-c50 25: tts_models/en/blizzard2013/capacitron-t2-c150_v2 26: tts_models/en/multi-dataset/tortoise-v2 27: tts_models/en/jenny/jenny 28: tts_models/es/mai/tacotron2-DDC [already downloaded] 29: tts_models/es/css10/vits [already downloaded] 30: tts_models/fr/mai/tacotron2-DDC 31: tts_models/fr/css10/vits 32: tts_models/uk/mai/glow-tts 33: tts_models/uk/mai/vits 34: tts_models/zh-CN/baker/tacotron2-DDC-GST 35: tts_models/nl/mai/tacotron2-DDC 36: tts_models/nl/css10/vits 37: tts_models/de/thorsten/tacotron2-DCA 38: tts_models/de/thorsten/vits 39: tts_models/de/thorsten/tacotron2-DDC 40: tts_models/de/css10/vits-neon 41: tts_models/ja/kokoro/tacotron2-DDC 42: tts_models/tr/common-voice/glow-tts 43: tts_models/it/mai_female/glow-tts 44: tts_models/it/mai_female/vits 45: tts_models/it/mai_male/glow-tts 46: tts_models/it/mai_male/vits 47: tts_models/ewe/openbible/vits 48: tts_models/hau/openbible/vits 49: tts_models/lin/openbible/vits 50: tts_models/tw_akuapem/openbible/vits 51: tts_models/tw_asante/openbible/vits 52: tts_models/yor/openbible/vits 53: tts_models/hu/css10/vits 54: tts_models/el/cv/vits 55: tts_models/fi/css10/vits 56: tts_models/hr/cv/vits 57: tts_models/lt/cv/vits 58: tts_models/lv/cv/vits 59: tts_models/mt/cv/vits 60: tts_models/pl/mai_female/vits 61: tts_models/pt/cv/vits 62: tts_models/ro/cv/vits 63: tts_models/sk/cv/vits 64: tts_models/sl/cv/vits 65: tts_models/sv/cv/vits 66: tts_models/ca/custom/vits 67: tts_models/fa/custom/glow-tts 68: tts_models/bn/custom/vits-male 69: tts_models/bn/custom/vits-female 70: tts_models/be/common-voice/glow-tts Name format: type/language/dataset/model 1: vocoder_models/universal/libri-tts/wavegrad 2: vocoder_models/universal/libri-tts/fullband-melgan [already downloaded] 3: vocoder_models/en/ek1/wavegrad 4: vocoder_models/en/ljspeech/multiband-melgan 5: vocoder_models/en/ljspeech/hifigan_v2 6: vocoder_models/en/ljspeech/univnet 7: vocoder_models/en/blizzard2013/hifigan_v2 8: vocoder_models/en/vctk/hifigan_v2 9: vocoder_models/en/sam/hifigan_v2 10: vocoder_models/nl/mai/parallel-wavegan 11: vocoder_models/de/thorsten/wavegrad 12: vocoder_models/de/thorsten/fullband-melgan 13: vocoder_models/de/thorsten/hifigan_v1 14: vocoder_models/ja/kokoro/hifigan_v1 15: vocoder_models/uk/mai/multiband-melgan 16: vocoder_models/tr/common-voice/hifigan 17: vocoder_models/be/common-voice/hifigan Name format: type/language/dataset/model 1: voice_conversion_models/multilingual/vctk/freevc24 [already downloaded]

Or filter the result with grep, for example to get spanish models:

tts --list_models | grep "/es"

28: tts_models/es/mai/tacotron2-DDC [already downloaded] 29: tts_models/es/css10/vits [already downloaded]

Text to speech

With all this you're ready to turn text into speech in a matter of seconds and in the language of your choice.

In the previous terminal, write the following, specifying the right model name:

tts --text "Ahora puedo hablar en español!" --model_name "tts_models/es/css10/vits" --out_path output/tts-es.wav

Make sure that the output folder exists, then check your result. The first time you'll get several files downloaded, and you'll have to accept Coqui-AI license. Next, voice generation only takes a few seconds:

Voice cloning

Lastly, the most amazing feature of this model is the voice cloning from only a few seconds of audio recording.

Like in the previous post, I took some 30 seconds of Ultron's voice from the film Avengers: Age of Ultron.

Sample in spanish:

Sample in english:

Now, let's prepare a python script to set all needed parameters, which will do the following:

Import torch and TTS
import torch from TTS.api import TTS
Define memory device (cuda or cpu). Using cpu should be enough (cuda might probably crash).
device="cpu"
Define text to be generated.
txt="Voice generated from text"
Define the reference audio sample (a .wav file of about 30 seconds)
sample="/voice-folder/voice.wav"
Call to TTS model
tts1=TTS("model_name").to(device)
File creation
tts1.tts_to_file(txt, speaker_wav=sample, language="es", file_path="output-folder/output-file.wav")

I called a script TRW-clone.py looking like this:

import torch
from TTS.api import TTS

# Get device ('cuda' or 'cpu')
device="cpu"

#Define text
txt="Bienvenido a este nuevo artículo del blog. Disfruta de tu visita."
#txt="Welcome to this new block post... Enjoy your visit!"

#Define audio sample
sample="../my-voices/ultron-es/mix.wav"
#sample="../my-voices/ultron-en/mix.wav"

#Run cloning
tts1 = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

tts1.tts_to_file(txt, speaker_wav=sample, language="es", file_path="../output/ultron-es.wav")

Run it from the TTS folder where the repo was installed:

cd TTS python3 TRW-clone.py

Results

Here I drop the results I got on my first tests.

Spanish:

English:

And with a couple of iterations you can get really amazing results.

Any doubts or comments you can still drop me a line on Twitter/X

🐦 @RoamingWorkshop

🐢TorToiSe-TTS: AI text to speech generation

18/01/2023 / TRW

AI trends are here to stay, and today there’s much talk about Vall-E and its future combination with GPT-3. But we must remember that all these Artificial Intelligence products come from collaborative investigation that have been freely available, so we will always find an open-source equivalent, and we must be thankful about it.

That’s the case of TorToiSe-TTS (text to speech), an AI voice generator from written text totally free to use in your PC.

https://github.com/neonbjb/tortoise-tts

Just listen to a little sample below from Morgan Freeman:

Installation

The easiest is to use the cloud script in Google Collab uploaded by the developer:

https://colab.research.google.com/drive/1wVVqUPqwiDBUVeWWOUNglpGhU3hg_cbR?usp=sharing

You just need to sign in and click "play" ▶ to run each block of code.

But for sure you want to run TorToiSe-TTS locally, without internet, and save the audio in your local drive, so let's move on.

Installing python3

Like many AI applications, TorToiSe-TTS runs in python, so you need python3 in your PC. I always recommend the use of Linux, but you might be able to run it in a Windows terminal as well.

https://www.python.org/downloads/

On Linux, just install it from your distro repo:

sudo apt-get update
sudo apt-get install python3.11

You'll also need the module venv to virtualize a python environment (it uses to come with the Windows installer):

sudo apt-get install python3.11-venv

Download repository

Download the official repository:

https://github.com/neonbjb/tortoise-tts

Be it the compressed file:

Or using git:

git clone https://github.com/neonbjb/tortoise-tts

You can also download my fork repository, where I add further installation instructions, a terminal automatic launcher and some test voices for Ultron (yes, Tony Stark's evil droid) which we'll see later on.

https://github.com/TheRoam/tortoise-tts-linuxLauncher

Create a python virtual environment

Next we'll have to install a series of python modules needed to run TorToiSe-TTS, but before this, we'll create a virtual python version, so the installation of these modules won't affect the rest of the python installation. You'll find this helpful when you use different AI apps that use different versions of the same module.

Open a terminal and write the following, so you'll create a "TTS" environment:

cd tortoise-tts
python3 -m venv TTS

Activate it this way:

source TTS/bin/activate

And now you'll see a referente to the TTS environment in the terminal:

(TTS) abc@123:~$ |

Install python modules

Let's now install the required modules, following the collab indications:

https://colab.research.google.com/drive/1wVVqUPqwiDBUVeWWOUNglpGhU3hg_cbR?usp=sharing#scrollTo=Gen09NM4hONQ

pip3 install -U scipy
pip3 install transformers==4.19.0
pip3 install -r requirements.txt
python3 setup.py install

Now you can try running TorToiSe-TTS, but some libraries will fail, depending on your python installation:

python3 scripts/tortoise_tts.py "This is The Roaming Workshop" -v "daniel" -p "fast" -o "test1.wav"

Try the previous command until you don't get any errors, installing the missing moules. In my case, they were the following:

pip3 install torch
pip3 install torchaudio
pip3 install llvmlite
pip3 install numpy==1.23

Finally, this test1.wav sound like this in the voice of daniel (which turns out to be Daniel Craig):

Using TorToiSe-TTS

The most simple program in TorToiSe-TTS is found in the folder scripts/tortoise_tts.py and these are the main arguments:

python3 scripts/tortoise_tts.py "text" -v "voice" -V "route/to/voices/folder" --seed number -p "fast" -o "output-file.wav"

"text": text chain that will be converted to audio
-v: voice to be used to convert text. It must be the name of one of the folders available in /tortoise/voices/
-V: specifies a folder for voices, in the case that you use a custom one.
--seed: seed number to feature the algorithm (can be any number)
-p: preset mode that determines quality ("ultra_fast", "fast", "standard", "high_quality").
-o: route and name of the output file. You must specify the fileformat, which is .wav

If you use my repo script TTS.sh you'll be asked for these arguments on screen and it will run the algorithm automatically.

Add your own voices

You can add more voices to TorToiSe-TTS. For example, I wanted to add the voice of Ultron, the Marvel supervillain, following the developer Neonbjb indications:

You must record 3 "clean" samples (without background noise or music) of about 10 seconds duration.
The format must be 16bits floating point WAV with 22500 sample rate (you can use Audacity)

Create a new folder inside /tortoise/voices (or anywhere, really) and save your recordings there.
When running TorToiSe-TTS, you'll need to call the voices folder with argument -V and the new voice with argument -v

For example, to use my Ultron recordings:

python3 sripts/tortoise_tts.py "This is The Roaming Workshop" -V "./tortoise/voices" -v "ultron-en" -p "fast" --seed 17 -o "./wavs/TRW_ultron3.wav"

Which sound like this:

I've taken cuts from a scene in Avengers: age of Ultron, both in English (ultron-en) and Spanish (ultron-es) which you can download from my repository.

"Ultron supervillain" by Optimised Stable Diffusion + Upscayl 2. The Roaming Workshop 2023.

Right now, the TorToiSe-TTS model is trained only in English, so it only woks properly in this language.

You can introduce text in another language, and AI will try to read it, but it will use the pronunciation learnt in English and it will sound weird.

If you are willing to train a model in your language (you need plenty GPU and several months), contact the developer, as he's keep to expand the project.

In the meantime, you can send any doubts or comments on 🐦 Twitter or 🐤 Koo!

🐦 @RoamingWorkshop

🐤 @TheRoamingWorkshop

🐸Coqui-AI/TTS: ultra fast voice generation and cloning from multilingual text

Web version

Installation

Checking language models and voices

Text to speech

Voice cloning

Results

🐢TorToiSe-TTS: AI text to speech generation

Installation

Installing python3

Download repository

Create a python virtual environment

Install python modules

Using TorToiSe-TTS

Add your own voices

Language