Wavthruvec: Latent Speech Representation As Intermediate Features For Neural Speech Synthesis
2022 Β· Hubert Siuzdak, Piotr Dura, Pol van Rijn, et al.
Abstract
Recent advances in neural text-to-speech research have been dominated by two-stage pipelines utilizing low-level intermediate speech representation such as mel-spectrograms. However, such predetermined features are fundamentally limited, because they do not allow to exploit the full potential of a data-driven approach through learning hidden representations. For this reason, several end-to-end methods have been proposed. However, such models are harder to train and require a large number of high-quality recordings with transcriptions. Here, we propose WavThruVec - a two-stage architecture that resolves the bottleneck by using high-dimensional Wav2Vec 2.0 embeddings as intermediate speech representation. Since these hidden activations provide high-level linguistic features, they are more robust to noise. That allows us to utilize annotated speech datasets of a lower quality to train the first-stage module. At the same time, the second-stage component can be trained on large-scale untran
Authors
(none)
Tags
Stats
Related papers
- Vec2wav 2.0: Advancing Voice Conversion Via Discrete Token Vocoders (2024)0.00
- Wav2vec: Unsupervised Pre-training For Speech Recognition (2019)0.00
- Wav2vec 2.0: A Framework For Self-supervised Learning Of Speech Representations (2020)0.00
- Av2wav: Diffusion-based Re-synthesis From Continuous Self-supervised Features For Audio-visual Speech Enhancement (2023)0.00
- Wasserstein GAN And Waveform Loss-based Acoustic Model Training For Multi-speaker Text-to-speech Synthesis Systems Using A Wavenet Vocoder (2018)12.61
- Unsupervised Speech Representation Learning Using Wavenet Autoencoders (2019)17.21
- Utilizing Neural Transducers For Two-stage Text-to-speech Via Semantic Token Prediction (2024)0.00
- Speech2vec: A Sequence-to-sequence Framework For Learning Word Embeddings From Speech (2018)14.15