MSA-ASR: Efficient Multilingual Speaker Attribution With Frozen ASR Models
2024 Β· Thai-Binh Nguyen, Alexander Waibel
Abstract
Speaker-attributed automatic speech recognition (SA-ASR) aims to transcribe speech while assigning transcripts to the corresponding speakers accurately. Existing methods often rely on complex modular systems or require extensive fine-tuning of joint modules, limiting their adaptability and general efficiency. This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions, using only standard monolingual ASR datasets. Our method involves training a speaker module to predict speaker embeddings based on weak labels without requiring additional ASR model modifications. Despite being trained exclusively with non-overlapping monolingual data, our approach effectively extracts speaker attributes across diverse multilingual datasets, including those with overlapping speech. Experimental results demonstrate competitive performance compared to strong baselines, highlighting the model's robustness and potential for pra
Authors
(none)
Tags
Stats
Related papers
- A Comparative Study Of Modular And Joint Approaches For Speaker-attributed ASR On Monaural Long-form Audio (2021)7.50
- End-to-end Multichannel Speaker-attributed ASR: Speaker Guided Decoder And Input Feature Analysis (2023)0.00
- A Comparative Study On Speaker-attributed Automatic Speech Recognition In Multi-party Meetings (2022)8.09
- A Comparative Study On Multichannel Speaker-attributed Automatic Speech Recognition In Multi-party Meetings (2022)5.24
- Investigation Of End-to-end Speaker-attributed ASR For Continuous Multi-talker Recordings (2020)10.35
- Improving Speaker Assignment In Speaker-attributed ASR For Real Meeting Applications (2024)0.00
- SSHR: Leveraging Self-supervised Hierarchical Representations For Multilingual Automatic Speech Recognition (2023)0.00
- Transcribe-to-diarize: Neural Speaker Diarization For Unlimited Number Of Speakers Using End-to-end Speaker-attributed ASR (2021)11.49