Brand Voice: Deep Learning for Speech Synthesis
Brand Voice: Deep Learning for Speech Synthesis

Abstract: 

As with many other fields, text-to-speech (TTS) has reached a new level with the recent advancements in Deep Learning. TTS is a seq2seq problem riddled with peculiarities and specific challenges. A three step approach is the modern solution: first, a sequence-to-sequence model align text and audio; second a feed-forward network predicts spectrogram from the text input and last, a vocoder model synthesizes the final waveform from the predicted spectrograms.

High-dimensional target space, non-bijective text-spectrograms correspondence, large discrepancy between input and target sequence length, student-teacher approach to avoid autoregression, very long output sequences ,slow inference speed and long training times (over 2 weeks), no explicit evaluation metric that correlates with perceived audio quality are some of the challenges of this problem.

During the course of this project we, together with other teams (e.g. Mozilla), have tackled many of these issues and successfully trained the current state-of-the art architectures such as Tacotron and Transformer-based models as well as developed their feed-forward counterparts and made all of it available open source.
Using these models, we created the first brand voice for Axel Springer, which now allows for audio content on the news website.

Bio: 

Francesco received his MS degree in Computational Mathematics in 2017 from the Department of Mathematics at the Technical University of Berlin. After a research experience in Bayesian machine learning, he pivoted into deep learning with generative models and computer vision. In his current position as a machine learning research engineer, he works on NLP and speech synthesis and has presented and authored a few open source projects related to these topics.

Open Data Science

Open Data Science
Innovation Center
101 Main St
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from Youtube
Vimeo
Consent to display content from Vimeo
Google Maps
Consent to display content from Google