Lstmse-net: Long Short Term Speech Enhancement Network For Audio-visual Speech Enhancement
2024 Β· Arnav Jain, Jasmer Singh Sanjotra, Harshvardhan Choudhary, et al.
Abstract
In this paper, we propose long short term memory speech enhancement network (LSTMSE-Net), an audio-visual speech enhancement (AVSE) method. This innovative method leverages the complementary nature of visual and audio information to boost the quality of speech signals. Visual features are extracted with VisualFeatNet (VFN), and audio features are processed through an encoder and decoder. The system scales and concatenates visual and audio features, then processes them through a separator network for optimized speech enhancement. The architecture highlights advancements in leveraging multi-modal data and interpolation techniques for robust AVSE challenge systems. The performance of LSTMSE-Net surpasses that of the baseline model from the COG-MHEAR AVSE Challenge 2024 by a margin of 0.06 in scale-invariant signal-to-distortion ratio (SISDR), \(0.03\) in short-time objective intelligibility (STOI), and \(1.32\) in perceptual evaluation of speech quality (PESQ). The source code of the prop
Authors
(none)
Tags
Stats
Related papers
- Audio-visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks (2017)17.39
- Improved Lite Audio-visual Speech Enhancement (2020)11.39
- An Empirical Study Of Visual Features For DNN Based Audio-visual Speech Enhancement In Multi-talker Environments (2020)3.58
- Vsanet: Real-time Speech Enhancement Based On Voice Activity Detection And Causal Spatial Attention (2023)5.24
- Multi-layer Feature Fusion Convolution Network For Audio-visual Speech Enhancement (2021)0.00
- La-voce: Low-snr Audio-visual Speech Enhancement Using Neural Vocoders (2022)0.00
- Robust End-to-end Deep Audiovisual Speech Recognition (2016)0.00
- VSEGAN: Visual Speech Enhancement Generative Adversarial Network (2021)8.60