Uniwav: Towards Unified Pre-training For Speech Representation Learning And Generation
2025 Β· Alexander H. Liu, Sang-Gil Lee, Chao-Han Huck Yang, et al.
Abstract
Pre-training and representation learning have been playing an increasingly important role in modern speech processing. Nevertheless, different applications have been relying on different foundation models, since predominant pre-training techniques are either designed for discriminative tasks or generative tasks. In this work, we make the first attempt at building a unified pre-training framework for both types of tasks in speech. We show that with the appropriate design choices for pre-training, one can jointly learn a representation encoder and generative audio decoder that can be applied to both types of tasks. We propose UniWav, an encoder-decoder framework designed to unify pre-training representation learning and generative tasks. On speech recognition, text-to-speech, and speech tokenization, UniWav achieves comparable performance to different existing foundation models, each trained on a specific task. Our findings suggest that a single general-purpose foundation model for speec
Authors
(none)
Tags
Stats
Related papers
- Wav2vec: Unsupervised Pre-training For Speech Recognition (2019)0.00
- Uniaudio: An Audio Foundation Model Toward Universal Audio Generation (2023)5.56
- Unispeech: Unified Speech Representation Learning With Labeled And Unlabeled Data (2021)0.00
- Wav-bert: Cooperative Acoustic And Linguistic Representation Learning For Low-resource Speech Recognition (2021)8.82
- Unispeaker: A Unified Approach For Multimodality-driven Speaker Generation (2025)2.26
- Wavthruvec: Latent Speech Representation As Intermediate Features For Neural Speech Synthesis (2022)10.07
- Unisyn: An End-to-end Unified Model For Text-to-speech And Singing Voice Synthesis (2022)0.00
- Wavlm: Large-scale Self-supervised Pre-training For Full Stack Speech Processing (2021)24.00