An Empirical Study Of Visual Features For DNN Based Audio-visual Speech Enhancement In Multi-talker Environments
2020 · Shrishti Saha Shetu, Soumitro Chakrabarty, Emanuël A. P. Habets
Abstract
Audio-visual speech enhancement (AVSE) methods use both audio and visual features for the task of speech enhancement and the use of visual features has been shown to be particularly effective in multi-speaker scenarios. In the majority of deep neural network (DNN) based AVSE methods, the audio and visual data are first processed separately using different sub-networks, and then the learned features are fused to utilize the information from both modalities. There have been various studies on suitable audio input features and network architectures, however, to the best of our knowledge, there is no published study that has investigated which visual features are best suited for this specific task. In this work, we perform an empirical study of the most commonly used visual features for DNN based AVSE, the pre-processing requirements for each of these features, and investigate their influence on the performance. Our study shows that despite the overall better performance of embedding-based
Authors
(none)
Tags
Stats
Related papers
- Audio-visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks (2017)17.39
- Multi-layer Feature Fusion Convolution Network For Audio-visual Speech Enhancement (2021)0.00
- An Overview Of Deep-learning-based Audio-visual Speech Enhancement And Separation (2020)18.31
- Lstmse-net: Long Short Term Speech Enhancement Network For Audio-visual Speech Enhancement (2024)8.57
- A Study On Joint Modeling And Data Augmentation Of Multi-modalities For Audio-visual Scene Classification (2022)5.24
- How To Leverage Dnn-based Speech Enhancement For Multi-channel Speaker Verification? (2022)0.00
- Joint Training Or Not: An Exploration Of Pre-trained Speech Models In Audio-visual Speaker Diarization (2023)0.00
- Robust Audio-visual Target Speaker Extraction With Emotion-aware Multiple Enrollment Fusion (2025)0.00