Deep learning base speech synthesis for reading aloud of lengthy and information rich texts in Swedish
Reference number | |
Coordinator | Kungliga Tekniska Högskolan - Språkbanken Tal |
Funding from Vinnova | SEK 6 617 200 |
Project duration | October 2019 - October 2024 |
Status | Completed |
Venture | AI - Leading and innovation |
Call | From AI-research to innovation |
Important results from the project
The project began when neural speech synthesis was relatively new, and there Swedish neural synthesis systems scarce. The goal of the project was to develop Swedish neural speech synthesis capable of handling long and information-rich text without making more errors than traditional synthesis. This goal was achieved. The sub-goals were to make Swedish adaptations in the processing (both training and synthesis), test them, and make them available. These goals were achieved, and tools and resources are now being prepared for release on the research infrastructure Språkbanken Tal.
Expected long term effects
The project contributes resources that provide support for Swedish speech synthesis, e.g. the pronunciation lexicon Braxen, the text preprocessor Sardin, adaptations of the training system Matcha, a Swedish test set for preprocessing and synthesis of long and information-rich text, and improved evaluation of this type of speech synthesis. The project has also contributed to collaborations in research and industry. The work with evaluation has gained international attention, and the project group is arranging one of the prestigious Dagstulh seminars on the topic in January.
Approach and implementation
The quick development during period has been exciting and complicated to follow. An obstacles we thought we would have to tackle - the poor quality of the step that goes from a two-dimensional representation of sound to Swedish sound - solved itself during the project time: that process now works well in general, for all languages. Other obstacles proved greater than expected. We spent an unexpected amount of resources on law, where one result is that we MTM managed to free up several resources for general use, and another that we were forced to cancel the recordings of a new voice.