Mmaudioreverbs: Video-guided Acoustic Modeling For Dereverberation And Room Impulse Response Estimation
2026 Β· Akira Takahashi, Ryosuke Sawata, Shusuke Takahashi, et al.
Abstract
arXiv:2605.00431v1 Announce Type: new Abstract: Although recent video-to-audio (V2A) models excelled at synthesizing semantically plausible sounds from visual inputs, they do not explicitly model room-acoustic effects such as reverberation or room impulse responses (RIRs), and thus offer limited controllability over these effects. However, we hypothesize that such V2A models implicitly have semantic knowledge of the relationship between spatial audio and the corresponding vision cues. In this paper, we revisit a V2A model for the sake of the above, and propose the way to utilize the pretrained model as prior for physically grounded room-acoustic processing. Based on one of the state-of-the-art V2A models, MMAudio, we propose MMAudioReverbs that is a unified framework dealing with i) dereverberation and ii) room impulse response (RIR) estimation without network architectural modification, and fine-tuned on a small dataset. Experimental results showed that audio and visual cues respecti
Authors
(none)
Tags
Stats
Related papers
- AV-RIR: Audio-visual Room Impulse Response Estimation (2023)0.00
- Audio-visual Multi-channel Speech Separation, Dereverberation And Recognition (2022)6.77
- Audio-visual Speech Codecs: Rethinking Audio-visual Speech Enhancement By Re-synthesis (2022)15.58
- Synthetic Wave-geometric Impulse Responses For Improved Speech Dereverberation (2022)0.00
- Audio-visual Speech Separation And Dereverberation With A Two-stage Multimodal Network (2019)12.47
- Buddy: Single-channel Blind Unsupervised Dereverberation With Diffusion Models (2024)8.35
- Towards Improved Room Impulse Response Estimation For Speech Recognition (2022)10.61
- Towards Improving Speaker Distance Estimation Through Generative Impulse Response Augmentation (2026)0.00