Contextual Audio-visual Switching For Speech Enhancement In Real-world Environments
2018 Β· Ahsan Adeel, Mandar Gogate, Amir Hussain
Abstract
Human speech processing is inherently multimodal, where visual cues (lip movements) help to better understand the speech in noise. Lip-reading driven speech enhancement significantly outperforms benchmark audio-only approaches at low signal-to-noise ratios (SNRs). However, at high SNRs or low levels of background noise, visual cues become fairly less effective for speech enhancement. Therefore, a more optimal, context-aware audio-visual (AV) system is required, that contextually utilises both visual and noisy audio features and effectively accounts for different noisy conditions. In this paper, we introduce a novel contextual AV switching component that contextually exploits AV cues with respect to different operating conditions to estimate clean audio, without requiring any SNR estimation. The switching module switches between visual-only (V-only), audio-only (A-only), and both AV cues at low, high and moderate SNR levels, respectively. The contextual AV switching component is develop
Authors
(none)
Tags
Stats
Related papers
- Switching Variational Auto-encoders For Noise-agnostic Audio-visual Speech Enhancement (2021)7.16
- Visual Context-driven Audio Feature Enhancement For Robust End-to-end Audio-visual Speech Recognition (2022)10.07
- Audio-visual Speech Codecs: Rethinking Audio-visual Speech Enhancement By Re-synthesis (2022)15.58
- AV2AV: Direct Audio-visual Speech To Audio-visual Speech Translation With Unified Audio-visual Speech Representation (2023)6.77
- Audio-visual Speech Enhancement Using Conditional Variational Auto-encoders (2019)13.65
- Av2wav: Diffusion-based Re-synthesis From Continuous Self-supervised Features For Audio-visual Speech Enhancement (2023)0.00
- Analyzing Utility Of Visual Context In Multimodal Speech Recognition Under Noisy Conditions (2019)0.00
- Improved Lite Audio-visual Speech Enhancement (2020)11.39