One Model To Rule Them All ? Towards End-to-end Joint Speaker Diarization And Speech Recognition
2023 Β· Samuele Cornell, Jee-Weon Jung, Shinji Watanabe, et al.
Abstract
This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition). SLIDAR can process arbitrary length inputs and can handle any number of speakers, effectively solving ``who spoke what, when'' concurrently. SLIDAR leverages a sliding window approach and consists of an end-to-end diarization-augmented speech transcription (E2E DAST) model which provides, locally, for each window: transcripts, diarization and speaker embeddings. The E2E DAST model is based on an encoder-decoder architecture and leverages recent techniques such as serialized output training and ``Whisper-style" prompting. The local outputs are then combined to get the final SD+ASR result by clustering the speaker embeddings to get global speaker identities. Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.
Authors
(none)
Tags
Stats
Related papers
- Transcribe-to-diarize: Neural Speaker Diarization For Unlimited Number Of Speakers Using End-to-end Speaker-attributed ASR (2021)11.49
- Simultaneous Speech Recognition And Speaker Diarization For Monaural Dialogue Recordings With Target-speaker Acoustic Models (2019)0.00
- Speakerlm: End-to-end Versatile Speaker Diarization And Recognition With Multimodal Large Language Models (2025)5.24
- Speaker Conditioned Acoustic Modeling For Multi-speaker Conversational ASR (2021)4.52
- Exploring Speaker-related Information In Spoken Language Understanding For Better Speaker Diarization (2023)0.00
- Joint Training Of Speaker Embedding Extractor, Speech And Overlap Detection For Diarization (2024)2.26
- Integration Of Speech Separation, Diarization, And Recognition For Multi-speaker Meetings: System Description, Comparison, And Analysis (2020)13.23
- End-to-end Speaker Diarization As Post-processing (2020)11.08