Real-time Target Sound Extraction
2022 Β· Bandhav Veluri, Justin Chan, Malek Itani, et al.
Abstract
We present the first neural network model to achieve real-time and streaming target sound extraction. To accomplish this, we propose Waveformer, an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder. This hybrid architecture uses dilated causal convolutions for processing large receptive fields in a computationally efficient manner while also leveraging the generalization performance of transformer-based architectures. Our evaluations show as much as 2.2-3.3 dB improvement in SI-SNRi compared to the prior models for this task while having a 1.2-4x smaller model size and a 1.5-2x lower runtime. We provide code, dataset, and audio samples: https://waveformer.cs.washington.edu/.
Authors
(none)
Tags
Stats
Related papers
- Real-time Streaming Wave-u-net With Temporal Convolutions For Multichannel Speech Enhancement (2021)0.00
- Efficient Neural Audio Synthesis (2018)0.00
- Nu-wave: A Diffusion Probabilistic Model For Neural Audio Upsampling (2021)12.40
- 3S-TSE: Efficient Three-stage Target Speaker Extraction For Real-time And Low-resource Applications (2023)5.24
- Speech Enhancement Deep-learning Architecture For Efficient Edge Processing (2024)0.00
- Efficient Neural Networks For Real-time Modeling Of Analog Dynamic Range Compression (2021)0.00
- Soloaudio: Target Sound Extraction With Language-oriented Audio Diffusion Transformer (2024)7.50
- Rawnet: Advanced End-to-end Deep Neural Network Using Raw Waveforms For Text-independent Speaker Verification (2019)15.34