Abstract: As with many other fields, text-to-speech (TTS) has reached a new level with the recent advancements in Deep Learning. TTS is a seq2seq problem riddled with peculiarities and specific challenges. A three step approach is the modern solution: first, a sequence-to-sequence model align text and audio; second a feed-forward network predicts spectrogram from the text input and last, a vocoder model synthesizes the final waveform from the predicted spectrograms.
High-dimensional target space, non-bijective text-spectrograms correspondence, large discrepancy between input and target sequence length, student-teacher approach to avoid autoregression, very long output sequences ,slow inference speed and long training times (over 2 weeks), no explicit evaluation metric that correlates with perceived audio quality are some of the challenges of this problem.
During the course of this project we, together with other teams (e.g. Mozilla), have tackled many of these issues and successfully trained the current state-of-the art architectures such as Tacotron and Transformer-based models as well as developed their feed-forward counterparts and made all of it available open source.
Using these models, we created the first brand voice for Axel Springer, which now allows for audio content on the news website.
Bio: Francesco received his MS degree in Computational Mathematics in 2017 from the Department of Mathematics at the Technical University of Berlin. After a research experience in Bayesian machine learning, he pivoted into deep learning with generative models and computer vision. In his current position as a machine learning research engineer, he works on NLP and speech synthesis and has presented and authored a few open source projects related to these topics.