Progressive Residual Extraction Based Pre-training For Speech Representation Learning
2024 Β· Tianrui Wang, Jin Li, Ziyang Ma, et al.
Abstract
Self-supervised learning (SSL) has garnered significant attention in speech processing, excelling in linguistic tasks such as speech recognition. However, jointly improving the performance of pre-trained models on various downstream tasks, each requiring different speech information, poses significant challenges. To this purpose, we propose a progressive residual extraction based self-supervised learning method, named ProgRE. Specifically, we introduce two lightweight and specialized task modules into an encoder-style SSL backbone to enhance its ability to extract pitch variation and speaker information from speech. Furthermore, to prevent the interference of reinforced pitch variation and speaker information with irrelevant content information learning, we residually remove the information extracted by these two modules from the main branch. The main branch is then trained using HuBERT's speech masking prediction to ensure the performance of the Transformer's deep-layer features on co
Authors
(none)
Tags
Stats
Related papers
- Unispeech-sat: Universal Speech Representation Learning With Speaker Aware Pre-training (2021)0.00
- Multi-resolution Hubert: Multi-resolution Speech Self-supervised Learning With Masked Unit Prediction (2023)0.00
- An Adapter Based Multi-label Pre-training For Speech Separation And Enhancement (2022)7.50
- Target Speech Extraction With Pre-trained Self-supervised Learning Models (2024)9.41
- Self-supervised Learning For Speech Recognition With Intermediate Layer Supervision (2021)9.41
- Non-contrastive Self-supervised Learning For Utterance-level Information Extraction From Speech (2022)9.59
- Weakly-supervised Speech Pre-training: A Case Study On Target Speech Recognition (2023)8.09
- Automatic Pronunciation Assessment Using Self-supervised Speech Representation Learning (2022)0.00