RIR-SF: Room Impulse Response Based Spatial Feature For Target Speech Recognition In Multi-channel Multi-speaker Scenarios
2023 Β· Yiwen Shao, Shi-Xiong Zhang, Dong Yu
Abstract
Automatic speech recognition (ASR) on multi-talker recordings is challenging. Current methods using 3D spatial data from multi-channel audio and visual cues focus mainly on direct waves from the target speaker, overlooking reflection wave impacts, which hinders performance in reverberant environments. Our research introduces RIR-SF, a novel spatial feature based on room impulse response (RIR) that leverages the speaker's position, room acoustics, and reflection dynamics. RIR-SF significantly outperforms traditional 3D spatial features, showing superior theoretical and empirical performance. We also propose an optimized all-neural multi-channel ASR framework for RIR-SF, achieving a relative 21.3% reduction in CER for target speaker ASR in multi-channel settings. RIR-SF enhances recognition accuracy and demonstrates robustness in high-reverberation scenarios, overcoming the limitations of previous methods.
Authors
(none)
Tags
Stats
Related papers
- Towards Improved Room Impulse Response Estimation For Speech Recognition (2022)10.61
- TS-RIR: Translated Synthetic Room Impulse Responses For Speech Augmentation (2021)8.35
- AV-RIR: Audio-visual Room Impulse Response Estimation (2023)0.00
- Rec-rir: Monaural Blind Room Impulse Response Identification Via Dnn-based Reverberant Speech Reconstruction In STFT Domain (2025)3.06
- 3-D Feature And Acoustic Modeling For Far-field Speech Recognition (2019)0.00
- Automatic Channel Selection And Spatial Feature Integration For Multi-channel Speech Recognition Across Various Array Topologies (2023)8.09
- Frequency Domain Multi-channel Acoustic Modeling For Distant Speech Recognition (2019)9.92
- Streaming Multi-speaker ASR With RNN-T (2020)10.07