Audio-visual Speech Separation In Noisy Environments With A Lightweight Iterative Model
2023 · Héctor Martel, Julius Richter, Kai Li, et al.
Abstract
We propose Audio-Visual Lightweight ITerative model (AVLIT), an effective and lightweight neural network that uses Progressive Learning (PL) to perform audio-visual speech separation in noisy environments. To this end, we adopt the Asynchronous Fully Recurrent Convolutional Neural Network (A-FRCNN), which has shown successful results in audio-only speech separation. Our architecture consists of an audio branch and a video branch, with iterative A-FRCNN blocks sharing weights for each modality. We evaluated our model in a controlled environment using the NTCD-TIMIT dataset and in-the-wild using a synthetic dataset that combines LRS3 and WHAM!. The experiments demonstrate the superiority of our model in both settings with respect to various audio-only and audio-visual baselines. Furthermore, the reduced footprint of our model makes it suitable for low resource applications.
Authors
(none)
Tags
Stats
Related papers
- Improved Lite Audio-visual Speech Enhancement (2020)11.39
- Multilingual Audio-visual Speech Recognition With Hybrid CTC/RNN-T Fast Conformer (2024)8.60
- Audio Visual Speech Recognition Using Deep Recurrent Neural Networks (2016)7.81
- Avformer: Injecting Vision Into Frozen Speech Models For Zero-shot AV-ASR (2023)7.81
- Speech Separation Using An Asynchronous Fully Recurrent Convolutional Neural Network (2021)0.00
- Avlnet: Learning Audio-visual Language Representations From Instructional Videos (2020)12.87
- Rtfs-net: Recurrent Time-frequency Modelling For Efficient Audio-visual Speech Separation (2023)0.00
- Robust End-to-end Deep Audiovisual Speech Recognition (2016)0.00