Self-attention Linguistic-acoustic Decoder
2018 Β· Santiago Pascual, Antonio Bonafonte, Joan SerrΓ
Abstract
The conversion from text to speech relies on the accurate mapping from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models like recurrent neural networks. Despite the good performance of such models (in terms of low distortion in the generated speech), their recursive structure tends to make them slow to train and to sample from. In this work, we try to overcome the limitations of recursive structure by using a module based on the transformer decoder network, designed without recurrent connections but emulating them with attention and positioning codes. Our results show that the proposed decoder network is competitive in terms of distortion when compared to a recurrent baseline, whilst being significantly faster in terms of CPU inference time. On average, it increases Mel cepstral distortion between 0.1 and 0.3 dB, but it is over an order of magnitude faster on average. Fast inference is important for the deployment of speech synthes
Authors
(none)
Tags
Stats
Related papers
- Neural Speech Synthesis With Transformer Network (2018)19.95
- Fastspeech: Fast, Robust And Controllable Text To Speech (2019)0.00
- Transformer-transducer: End-to-end Speech Recognition With Self-attention (2019)0.00
- Streaming Transformer-based Acoustic Models Using Self-attention With Augmented Memory (2020)0.00
- Small-e: Small Language Model With Linear Attention For Efficient Speech Synthesis (2024)9.02
- Effective Decoder Masking For Transformer Based End-to-end Speech Recognition (2020)0.00
- Conv-transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-end Speech Recognition (2020)11.08
- Attentional Speech Recognition Models Misbehave On Out-of-domain Utterances (2020)0.00