Tf-locoformer: Transformer With Local Modeling By Convolution For Speech Separation And Enhancement
2024 · Kohei Saijo, Gordon Wichern, François G. Germain, et al.
Abstract
Time-frequency (TF) domain dual-path models achieve high-fidelity speech separation. While some previous state-of-the-art (SoTA) models rely on RNNs, this reliance means they lack the parallelizability, scalability, and versatility of Transformer blocks. Given the wide-ranging success of pure Transformer-based architectures in other fields, in this work we focus on removing the RNN from TF-domain dual-path models, while maintaining SoTA performance. This work presents TF-Locoformer, a Transformer-based model with LOcal-modeling by COnvolution. The model uses feed-forward networks (FFNs) with convolution layers, instead of linear layers, to capture local information, letting the self-attention focus on capturing global patterns. We place two such FFNs before and after self-attention to enhance the local-modeling capability. We also introduce a novel normalization for TF-domain dual-path models. Experiments on separation and enhancement datasets show that the proposed model meets or exce
Authors
(none)
Tags
Stats
Related papers
- On Time Domain Conformer Models For Monaural Speech Separation In Noisy Reverberant Acoustic Environments (2023)5.84
- Dual-path Transformer Network: Direct Context-aware Modeling For End-to-end Monaural Speech Separation (2020)18.24
- Exploring Self-attention Mechanisms For Speech Separation (2022)12.54
- Multi-dimensional And Multi-scale Modeling For Speech Separation Optimized By Discriminative Learning (2023)0.00
- Dasformer: Deep Alternating Spectrogram Transformer For Multi/single-channel Speech Separation (2023)0.00
- Transmask: A Compact And Fast Speech Separation Model Based On Transformer (2021)8.82
- Tiny-sepformer: A Tiny Time-domain Transformer Network For Speech Separation (2022)8.82
- Attention Is All You Need In Speech Separation (2020)20.59