Survey Of End-to-end Multi-speaker Automatic Speech Recognition For Monaural Audio
2025 Β· Xinlu He, Jacob Whitehill
Abstract
Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (SIMO vs.~SISO) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on these two paradigms; (3) extensions to long-form speech, including segmentation strategy and speaker-
Authors
(none)
Tags
Stats
Related papers
- End-to-end Monaural Multi-speaker ASR System Without Pretraining (2018)11.93
- A Comparative Study Of Modular And Joint Approaches For Speaker-attributed ASR On Monaural Long-form Audio (2021)7.50
- Investigation Of End-to-end Speaker-attributed ASR For Continuous Multi-talker Recordings (2020)10.35
- An Investigation Of End-to-end Multichannel Speech Recognition For Reverberant And Mismatch Conditions (2019)0.00
- MIMO-SPEECH: End-to-end Multi-channel Multi-speaker Speech Recognition (2019)13.93
- Directional ASR: A New Paradigm For E2E Multi-speaker Speech Recognition With Source Localization (2020)8.09
- Streaming Multi-speaker ASR With RNN-T (2020)10.07
- A Comparative Study On Multichannel Speaker-attributed Automatic Speech Recognition In Multi-party Meetings (2022)5.24