Adapting Self-supervised Models To Multi-talker Speech Recognition Using Speaker Embeddings
2022 · Zili Huang, Desh Raj, Paola García, et al.
Abstract
Self-supervised learning (SSL) methods which learn representations of data without explicit supervision have gained popularity in speech-processing tasks, particularly for single-talker applications. However, these models often have degraded performance for multi-talker scenarios -- possibly due to the domain mismatch -- which severely limits their use for such applications. In this paper, we investigate the adaptation of upstream SSL models to the multi-talker automatic speech recognition (ASR) task under two conditions. First, when segmented utterances are given, we show that adding a target speaker extraction (TSE) module based on enrollment embeddings is complementary to mixture-aware pre-training. Second, for unsegmented mixtures, we propose a novel joint speaker modeling (JSM) approach, which aggregates information from all speakers in the mixture through their embeddings. With controlled experiments on Libri2Mix, we show that using speaker embeddings provides relative WER improv
Authors
(none)
Tags
Stats
Related papers
- Target Speech Extraction With Pre-trained Self-supervised Learning Models (2024)9.41
- Self-supervised Learning With Bi-label Masked Speech Prediction For Streaming Multi-talker Speech Recognition (2022)5.24
- Unispeech-sat: Universal Speech Representation Learning With Speaker Aware Pre-training (2021)0.00
- Efficient Infusion Of Self-supervised Representations In Automatic Speech Recognition (2024)0.00
- The Efficacy Of Self-supervised Speech Models For Audio Representations (2022)0.00
- Weakly-supervised Speech Pre-training: A Case Study On Target Speech Recognition (2023)8.09
- An Adapter Based Pre-training For Efficient And Scalable Self-supervised Speech Representation Learning (2021)8.35
- Exploring Effective Fusion Algorithms For Speech Based Self-supervised Learning Models (2022)0.00