AV-RIR: Audio-visual Room Impulse Response Estimation
2023 Β· Anton Ratnarajah, Sreyan Ghosh, Sonal Kumar, et al.
Abstract
Accurate estimation of Room Impulse Response (RIR), which captures an environment's acoustic properties, is important for speech processing and AR/VR applications. We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and the visual cues of its corresponding environment. AV-RIR builds on a novel neural codec-based architecture that effectively captures environment geometry and materials properties and solves speech dereverberation as an auxiliary task by using multi-task learning. We also propose Geo-Mat features that augment material information into visual cues and CRIP that improves late reverberation components in the estimated RIR via image-to-RIR retrieval by 86%. Empirical results show that AV-RIR quantitatively outperforms previous audio-only and visual-only approaches by achieving 36% - 63% improvement across various acoustic metrics in RIR estimation. Additionally, it also achieves higher pref
Authors
(none)
Tags
Stats
Related papers
- Rec-rir: Monaural Blind Room Impulse Response Identification Via Dnn-based Reverberant Speech Reconstruction In STFT Domain (2025)3.06
- Mmaudioreverbs: Video-guided Acoustic Modeling For Dereverberation And Room Impulse Response Estimation (2026)0.00
- Towards Improved Room Impulse Response Estimation For Speech Recognition (2022)10.61
- RIR-SF: Room Impulse Response Based Spatial Feature For Target Speech Recognition In Multi-channel Multi-speaker Scenarios (2023)0.00
- IR-GAN: Room Impulse Response Generator For Far-field Speech Recognition (2020)11.93
- Improving Reverberant Speech Separation With Multi-stage Training And Curriculum Learning (2021)0.00
- TS-RIR: Translated Synthetic Room Impulse Responses For Speech Augmentation (2021)8.35
- Audio-visual Multi-channel Speech Separation, Dereverberation And Recognition (2022)6.77