On The Use Of Self-supervised Speech Representations In Spontaneous Speech Synthesis
2023 Β· Siyang Wang, Gustav Eje Henter, Joakim Gustafson, et al.
Abstract
Self-supervised learning (SSL) speech representations learned from large amounts of diverse, mixed-quality speech data without transcriptions are gaining ground in many speech technology applications. Prior work has shown that SSL is an effective intermediate representation in two-stage text-to-speech (TTS) for both read and spontaneous speech. However, it is still not clear which SSL and which layer from each SSL model is most suited for spontaneous TTS. We address this shortcoming by extending the scope of comparison for SSL in spontaneous TTS to 6 different SSLs and 3 layers within each SSL. Furthermore, SSL has also shown potential in predicting the mean opinion scores (MOS) of synthesized speech, but this has only been done in read-speech MOS prediction. We extend an SSL-based MOS prediction framework previously developed for scoring read speech synthesis and evaluate its performance on synthesized spontaneous speech. All experiments are conducted twice on two different spontaneou
Authors
(none)
Tags
Stats
Related papers
- Analytic Study Of Text-free Speech Synthesis For Raw Audio Using A Self-supervised Learning Model (2024)0.00
- Analyzing The Factors Affecting Usefulness Of Self-supervised Pre-trained Representations For Speech Recognition (2022)0.00
- Low-resource Self-supervised Learning With Ssl-enhanced TTS (2023)0.00
- LE-SSL-MOS: Self-supervised Learning MOS Prediction With Listener Enhancement (2023)9.23
- Investigating Self-supervised Learning For Speech Enhancement And Separation (2022)13.44
- SALTTS: Leveraging Self-supervised Speech Representations For Improved Text-to-speech Synthesis (2023)5.24
- What Do Self-supervised Speech And Speaker Models Learn? New Findings From A Cross Model Layer-wise Analysis (2024)8.09
- Lebenchmark: A Reproducible Framework For Assessing Self-supervised Representation Learning From Speech (2021)11.39