Exploring The Integration Of Speech Separation And Recognition With Self-supervised Learning Representation
2023 Β· Yoshiki Masuyama, Xuankai Chang, Wangyou Zhang, et al.
Abstract
Neural speech separation has made remarkable progress and its integration with automatic speech recognition (ASR) is an important direction towards realizing multi-speaker ASR. This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end. In detail, we explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model. We employ the recent self-supervised learning representation (SSLR) as a feature and improve the recognition performance from the case with filterbank features. To further improve multi-speaker recognition performance, we present a carefully designed training strategy for integrating speech separation and recognition with SSLR. The proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set, significantly outperforming
Authors
(none)
Tags
Stats
Related papers
- End-to-end Integration Of Speech Recognition, Dereverberation, Beamforming, And Self-supervised Learning Representation (2022)8.60
- Investigating Self-supervised Learning For Speech Enhancement And Separation (2022)13.44
- Tf-gridnet: Integrating Full- And Sub-band Modeling For Speech Separation (2022)0.00
- Investigation Of Practical Aspects Of Single Channel Speech Separation For ASR (2021)7.81
- Elevating Robust Multi-talker ASR By Decoupling Speaker Separation And Speech Recognition (2025)0.00
- SSHR: Leveraging Self-supervised Hierarchical Representations For Multilingual Automatic Speech Recognition (2023)0.00
- WHAMR!: Noisy And Reverberant Single-channel Speech Separation (2019)16.10
- Exploring Self-attention Mechanisms For Speech Separation (2022)12.54