An Experimental Study: Assessing The Combined Framework Of Wavlm And BEST-RQ For Text-to-speech Synthesis
2023 Β· Via Nielson, Steven Hillis
Abstract
We propose a new model architecture specifically suited for text-to-speech (TTS) models. We combine WavLM, a pre-trained self-supervised learning (SSL) speech model, and the BEST-RQ vector quantization framework. We assess the extent to which the more task-agnostic WavLM, coupled with the superior suitability of the simplistic BEST-RQ framework for a wider array of downstream tasks, yields favorable outcomes. Experiments on the LibriSpeech dataset with SUPERB benchmarking assert that the proposed model significantly underperforms. We speculate the underlying reason for this performance is related to the difference between featurizing raw audio waveforms and spectrograms with a quantizer. We discuss the limitations of this approach to better guide future advancements in TTS.
Authors
(none)
Tags
Stats
Related papers
- QS-TTS: Towards Semi-supervised Text-to-speech Synthesis Via Vector-quantized Self-supervised Speech Representation Learning (2023)2.26
- VQTTS: High-fidelity Text-to-speech Synthesis With Self-supervised VQ Acoustic Feature (2022)11.85
- Spark-tts: An Efficient Llm-based Text-to-speech Model With Single-stream Decoupled Speech Tokens (2025)8.08
- End-to-end Text-to-speech Using Latent Duration Based On VQ-VAE (2020)6.77
- Improving Audio Codec-based Zero-shot Text-to-speech Synthesis With Multi-modal Context And Large Language Model (2024)2.26
- DQR-TTS: Semi-supervised Text-to-speech Synthesis With Dynamic Quantized Representation (2023)2.26
- Evaluating Text-to-speech Synthesis From A Large Discrete Token-based Speech Language Model (2024)0.00
- Wav2vec 2.0: A Framework For Self-supervised Learning Of Speech Representations (2020)0.00