The production of artificial natural-sounding human speech is a fascinating topic due to its complexity and surprising results, with applications that range from chatbots to the automatization of audio content in news media. One obvious example of a Text-to-Speech (TTS) application for news media is a Read Aloud feature for a news article that makes it possible to listen while commuting. In fact, this feature is already offered by some newspapers, though they mainly use external TTS services to power their applications.

As Europe’s largest digital publisher, we decided to go one step further by building up our own technology in-house and creating a custom brand voice instead of relying on third-party providers. Out of our work, have released two Open Source projects: ForwardTacotron and TransformerTTS.

Here, and then more in-depth and interactively in the workshop at ODSC East 2021, we will have a look at the steps needed to train a TTS Transformer-based neural network. We will then rely on the TransformerTTS repository.

Go ahead and listen to some of our samples!

ForwardTacotron

TransformerTTS

If you are new to the topic of speech synthesis this will serve as an introduction to the general concepts of the workshop session, in which we will go into all the interesting details as well as do some play-around with the models.

If you already are familiar with the topic of speech synthesis this will be useful to refresh some of the ideas and get to know some of the details of the repository that we will use during the workshop.

Speech Synthesis

Modern speech synthesis is a multi-step problem where multiple neural networks are trained and deployed to convert raw text into a natural sounding voice and one of the best approaches, Microsoft released their FastSpeech paper in 2019, this process is divided into 3 steps:

– aligning text and audio using an autoregressive model
– converting text to spectrogram using a feed-forward TTS model
– reconstructing natural sounding voice from spectrograms with a neural vocoder model, like MelGAN or WaveRNN

Speech can also be reconstructed from the TTS model outputs using the Griffin-Lim algorithm, although the generated voice will not sound as natural.

Here we will focus on the first two steps, which dictate the prosody and expressiveness of the produced speech by doing the actual text-to-speech translation.

Let’s dive in.

Setup

The Data

As with most machine learning problems, the first step is gathering the data. In this case, training data is made of wav files containing the voice you want to reproduce and the corresponding transcriptions.

Like most interesting things in life, there is a bright side and a dark side.

In the dark side lies the fact that to obtain some reasonable results you will need to collect at the very least about 7–8 hours of very clean and diverse data (common dataset used in research contain about 20 hours of clear text-voice data).

On the bright side, you will not need the sound files and the text to be aligned with complicated timestamps and pre-computed text features: raw voice and text will be enough, as long as they do represent each other.

So make sure that the transcriptions are correct and consistent!

Create a csv file containing all your transcriptions: if you have a voice file called my_voice_000001.wav saying “I am a wonderful being.”, your metadata.csv file should contain the line

my_voice_000001.wav|I am a wonderful being.

There is actually another good news, in case you don’t have the data and don’t want to spend hours in front of a professional microphone: there is some well-curated open-source data out there for you to train on, one of the most commonly used being LJSpeech.

Go ahead and get it from https://keithito.com/LJ-Speech-Dataset/. This dataset already contains a wav folder and a metadata file in the format we like.

Now that we got the data ready, let’s jump into the next step:

The Code for Speech Synthesis

Clone the Speech Synthesis repository and follow the installation instructions.

git clone https://github.com/as-ideas/TransformerTTS.git

Preprocessing: the what

The first thing you want to do after getting the data, is preparing it for your models.

There are two preprocessing steps, one for text and one for audio.

Due to the discrepancy between how things are written and how they are actually spoken out loud, it is convenient to convert the text into phonemes: these are perceptually distinct units of sound that distinguish one word from another in a particular language. For instance the words fake news are represented as ˈfɛɪ̯kˌnjuːs. You can check https://en.wiktionary.org for more of these. This representation has two major advantages. First of all, it makes the speech to text mapping injective, as there will be only one way of transcribing the voice with phonemes, which greatly simplifies the text-to-speech task, although it remains an ill-defined problem, as there is many ways any word can be spoken: pitch, rhythm and volume are some of the features of speech which are not represented by phonemes. Afterall there is a reason why we summon those big neural networks for the task! A second advantage is control and flexibility: if your model learns to pronounce all the phonemes correctly it could, in principle, pronounce any word of any language. You will see this is not really the case, as we will need to present the model with combinations of these phonemes for it to sounds natural.

The next step is preparing the audio: in order to again simplify the problem and work in the neat space of Fourier transforms which greatly simplify the operations on waveforms, we convert our audio snippets into Mel-Spectrogram (“mel” in short): these are power-spectra, a logarithmically scaled windowed Short-time Fourier transform of the audio.

Speech Synthesis
Mel-Spectrogram of an utterance

Basically now we deal with compressed images instead of lengthy audio time-series, a computationally much cheaper task.

Preprocessing: the how

In order to actually do all of these, let us jump back to our repo and set up the yaml configuration files. There isn’t much you need to do as the standard settings will work for most use-cases.

Open up the session_paths.yaml file and edit it. Assuming you have the wav and metadata files under the directory /data/datasets/MyVoice/ and that you want to keep the training logs and weight files under /data/logs/tts/

wav_directory: '/data/datasets/MyVoice/'
metadata_path: '/data/datasets/MyVoice/metadata.csv'
log_directory: '/data/logs/tts/'
train_data_directory: 'transformer_tts_data'
data_config: './data_config.yaml'
aligner_config: './aligner_config.yaml'
tts_config: './tts_config.yaml'data_name: ljspeech # raw data naming for default data reader (select function from data/metadata_readers.py)

Under data_name you can specify a name for your data. Should you change it, make sure to add an appropriate metadata reader to the python file data/metadata_readers.py. For instance, for data called my_voice, that has the metadata you actually want to use on the third column of the csv instead of the second:

def my_voice(metadata_path: str, column_sep='|') -> dict:
    text_dict = {}
    with open(metadata_path, 'r', encoding='utf-8') as f:
        for l in f.readlines():
            l_split = l.split(column_sep)
            filename, text = l_split[0], l_split[2]
            if filename.endswith('.wav'):
                filename = filename.split('.')[0]
            text = text.replace('n', '')
            text_dict.update({filename: text})
    return text_dict

Let us leave the other options unchanged for now.

Note that if you are using the LJSpeech dataset the standard options will allow you to use a pre-trained vocoder model from any of these great open source repositories: MelGAN and HiFi-GAN.

All set, now from the root project folder run

python create_training_data.py --config configs/session_paths.yaml

Training

Now that we carried all the necessary preprocessing steps, let us train some models, shall we?

Aligning the data: the what

As mentioned earlier, we do not need text-audio alignment information: we can create it ourselves. The first step will in fact be to train an aligner model, that will compute the duration for each of the phonemes and all the punctuation present in the text.

This is a very delicate step in the pipeline, as it will greatly impact the results of our final model.

To extract these alignments we use an autoregressive model that takes as inputs both the phonemized metadata and the correspondent target mel: given the whole text input sequence and the first nth mel timesteps, the model will predict the (n+1)th spectrogram timestep.

Speech Synthesis

Autoregression mechanism. Credit https://deepmind.com/blog/article/wavenet-generative-model-raw-audio

A major trick and caveat here is that we can and want to use the target mels to run the autoregression in parallel in what is called a “teacher forcing” process: instead of using its own first n predictions to predict the next (as a normal autoregressive model would do), we use the first n targets as input.
This, together with a causal mask (an upper triangular mask) which prevents the model from using the information from the future timesteps, allows us to train all the timesteps in parallel.

Interestingly, we will use teacher forcing also during prediction: all we need are those sweet sweet alignments which, with teacher forcing, are easier to computed. No need to bother doing actual (hard) autoregressive predictions!

Aligning the data: the how

To train the aligner model, simply run from the root project folder

python train_aligner.py --config configs/session_paths.yaml

During this phase the transformer model will build an attention mapping between the input and the target, which we will then use to extract the alignent information, or phoneme-durations, that we will need for our final TTS model.

Image for post

Attention mapping: the horizontal and vertical axes correspond to mel-timesteps and phonemes respectively

There is a lot of nitty gritty details that go into this phase, as well as clear indicators of whether your training session is being successful. All of this can be easily monitored through the tensorboard session so make sure to tune in for our workshop to get all the details and a thorough explanation of this step.

Get them durations

Now that we trained the aligner, one of the hardest steps is behind us. All we’re left to do is to extract the information we need from the aligner model and train the TTS model.

Given that the attention maps are often far from perfect and contain uncertainty, as you can probably grasp from the above picture, we use a graph algorithm to make sure that the durations we extract are solid: the famous Dijkstra algorithm for minimum paths.

We will dig deeper into this process during the workshop; for now it’s enough to know that this step helps us exclude uncertainty and ignore unlikely jumps that occur in the raw attention maps. In the picture below you can see the result— a much cleaner extracted alignment map on top of the actual attention maps.

Image for post

Cleaned attention overlapped on raw attention.

Now that we have cleaned the attention maps, we can finally simply count how many mel timesteps correspond to each phoneme to get our phoneme-durations and proceed to the next step!

Again, no need to code any of this yourself, simply run

python extract_durations.py --config configs/session_paths.yaml

to get the last piece of the puzzle.

Train the TTS model

Finally

python train_tts.py --config configs/session_paths.yaml

will trigger the last operation of the pipeline.

With this step we will train a FastPitch-like model, that will predict on the fly the duration for each phoneme (using the data extracted with the previous step) as well as the pitch profile of the target audio file.

Speech Synthesis
Forward TTS architecture, from FastPitch. FFT is a Feed-Forward Transformer layer

There are some further choices that go into this training step, such as minor architecture details as well quite a few knobs, levers and indicators to keep under control, all of which you can access from the Tensorboard session.

Once training is done, using the model will be as easy as

from utils.config_manager import Config
from data.audio import Audio

config_loader = Config(config_path=f'config/session_paths.yaml')
audio = Audio(config_loader.config)
model = config_loader.load_model()
out = model.predict("I don't speak often, but when I do, it's for ODSC East.")

# Convert spectrogram to wav (with griffin lim)
wav = audio.reconstruct_waveform(out['mel'].numpy().T)

Having control over pitch and speed/duration allows us to greatly control the prosody of the synthesized voice and, most importantly, enhance our results with greater control during training.

If you wish to know more about the remaining details, have a thorough look at a full training session at ODSC East 2021 (over 3 days of training), play around with some model or maybe just chat about the topic, please join our workshop session, “Brand Voice: Deep Learning for Speech Synthesis,” on April 1st at 11 AM EST.

Hope to see you there!


About the author on Deep Learning for Speech Synthesis/ODSC East 2021 speaker:

Francesco Cardinale received his MS degree in Computational Mathematics in 2017 from the Department of Mathematics at the Technical University of Berlin. After a research experience in Bayesian machine learning, he pivoted into deep learning with generative models and computer vision. In his current position as a machine learning research engineer, he works on NLP and speech synthesis and has presented and authored a few open source projects related to these topics.