Joint Beamforming And Speaker-attributed ASR For Real Distant-microphone Meeting Transcription
2024 Β· Can Cui, Imran Ahamad Sheikh, Mostafa Sadeghi, et al.
Abstract
Distant-microphone meeting transcription is a challenging task. State-of-the-art end-to-end speaker-attributed automatic speech recognition (SA-ASR) architectures lack a multichannel noise and reverberation reduction front-end, which limits their performance. In this paper, we introduce a joint beamforming and SA-ASR approach for real meeting transcription. We first describe a data alignment and augmentation method to pretrain a neural beamformer on real meeting data. We then compare fixed, hybrid, and fully neural beamformers as front-ends to the SA-ASR model. Finally, we jointly optimize the fully neural beamformer and the SA-ASR model. Experiments on the real AMI corpus show that, while state-of-the-art multi-frame cross-channel attention based channel fusion fails to improve ASR performance, fine-tuning SA-ASR on the fixed beamformer's output and jointly fine-tuning SA-ASR with the neural beamformer reduce the word error rate by 8% and 9% relative, respectively.
Authors
(none)
Tags
Stats
Related papers
- Improving Speaker Assignment In Speaker-attributed ASR For Real Meeting Applications (2024)0.00
- Exploring End-to-end Multi-channel ASR With Bias Information For Meeting Transcription (2020)7.16
- A Comparative Study On Multichannel Speaker-attributed Automatic Speech Recognition In Multi-party Meetings (2022)5.24
- A Comparative Study Of Modular And Joint Approaches For Speaker-attributed ASR On Monaural Long-form Audio (2021)7.50
- End-to-end Multichannel Speaker-attributed ASR: Speaker Guided Decoder And Input Feature Analysis (2023)0.00
- A Comparative Study On Speaker-attributed Automatic Speech Recognition In Multi-party Meetings (2022)8.09
- Speaker Adapted Beamforming For Multi-channel Automatic Speech Recognition (2018)5.84
- Spatially-augmented Sequence-to-sequence Neural Diarization For Meetings (2025)0.00