Target Speech Extraction With Pre-trained Self-supervised Learning Models
2024 Β· Junyi Peng, Marc Delcroix, Tsubasa Ochiai, et al.
Abstract
Pre-trained self-supervised learning (SSL) models have achieved remarkable success in various speech tasks. However, their potential in target speech extraction (TSE) has not been fully exploited. TSE aims to extract the speech of a target speaker in a mixture guided by enrollment utterances. We exploit pre-trained SSL models for two purposes within a TSE framework, i.e., to process the input mixture and to derive speaker embeddings from the enrollment. In this paper, we focus on how to effectively use SSL models for TSE. We first introduce a novel TSE downstream task following the SUPERB principles. This simple experiment shows the potential of SSL models for TSE, but extraction performance remains far behind the state-of-the-art. We then extend a powerful TSE architecture by incorporating two SSL-based modules: an Adaptive Input Enhancer (AIE) and a speaker encoder. Specifically, the proposed AIE utilizes intermediate representations from the CNN encoder by adjusting the time resolut
Authors
(none)
Tags
Stats
Related papers
- Adapting Self-supervised Models To Multi-talker Speech Recognition Using Speaker Embeddings (2022)10.61
- Downstream Task Agnostic Speech Enhancement With Self-supervised Representation Loss (2023)6.77
- Unispeech-sat: Universal Speech Representation Learning With Speaker Aware Pre-training (2021)0.00
- Non-contrastive Self-supervised Learning For Utterance-level Information Extraction From Speech (2022)9.59
- Investigating Self-supervised Learning For Speech Enhancement And Separation (2022)13.44
- Exploring Effective Fusion Algorithms For Speech Based Self-supervised Learning Models (2022)0.00
- The Efficacy Of Self-supervised Speech Models For Audio Representations (2022)0.00
- MT4SSL: Boosting Self-supervised Speech Representation Learning By Integrating Multiple Targets (2022)0.00