Speakerlm: End-to-end Versatile Speaker Diarization And Recognition With Multimodal Large Language Models
2025 Β· Han Yin, Yafeng Chen, Chong Deng, et al.
Abstract
The Speaker Diarization and Recognition (SDR) task aims to predict "who spoke when and what" within an audio clip, which is a crucial task in various real-world multi-speaker scenarios such as meeting transcription and dialogue systems. Existing SDR systems typically adopt a cascaded framework, combining multiple modules such as speaker diarization (SD) and automatic speech recognition (ASR). The cascaded systems suffer from several limitations, such as error propagation, difficulty in handling overlapping speech, and lack of joint optimization for exploring the synergy between SD and ASR tasks. To address these limitations, we introduce SpeakerLM, a unified multimodal large language model for SDR that jointly performs SD and ASR in an end-to-end manner. Moreover, to facilitate diverse real-world scenarios, we incorporate a flexible speaker registration mechanism into SpeakerLM, enabling SDR under different speaker registration settings. SpeakerLM is progressively developed with a mult
Authors
(none)
Tags
Stats
Related papers
- Diarizationlm: Speaker Diarization Post-processing With Large Language Models (2024)10.21
- SEAL: Speaker Error Correction Using Acoustic-conditioned Large Language Models (2025)0.00
- One Model To Rule Them All ? Towards End-to-end Joint Speaker Diarization And Speech Recognition (2023)9.59
- Llm-based Speaker Diarization Correction: A Generalizable Approach (2024)7.16
- Lexical Speaker Error Correction: Leveraging Language Models For Speaker Diarization Error Correction (2023)0.00
- Enhancing Speaker Diarization With Large Language Models: A Contextual Beam Search Approach (2023)7.50
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00
- Large Language Model Can Transcribe Speech In Multi-talker Scenarios With Versatile Instructions (2024)11.23