Mossformer: Pushing The Performance Limit Of Monaural Speech Separation Using Gated Single-head Transformer With Convolution-augmented Joint Self-attentions
2023 Β· Shengkui Zhao, Bin Ma
Abstract
Transformer based models have provided significant performance improvements in monaural speech separation. However, there is still a performance gap compared to a recent proposed upper bound. The major limitation of the current dual-path Transformer models is the inefficient modelling of long-range elemental interactions and local feature patterns. In this work, we achieve the upper bound by proposing a gated single-head transformer architecture with convolution-augmented joint self-attentions, named \textit\{MossFormer\} (\textit\{Mo\}naural \textit\{s\}peech \textit\{s\}eparation Trans\textit\{Former\}). To effectively solve the indirect elemental interactions across chunks in the dual-path architecture, MossFormer employs a joint local and global self-attention architecture that simultaneously performs a full-computation self-attention on local chunks and a linearised low-cost self-attention over the full sequence. The joint attention enables MossFormer model full-sequence elemental
Authors
(none)
Tags
Stats
Related papers
- Exploring Self-attention Mechanisms For Speech Separation (2022)12.54
- Attention Is All You Need In Speech Separation (2020)20.59
- U-former: Improving Monaural Speech Enhancement With Multi-head Self And Cross Attention (2022)0.00
- Dual-path Transformer Network: Direct Context-aware Modeling For End-to-end Monaural Speech Separation (2020)18.24
- Monaural Multi-speaker Speech Separation Using Efficient Transformer Model (2023)0.00
- Dasformer: Deep Alternating Spectrogram Transformer For Multi/single-channel Speech Separation (2023)0.00
- Transmask: A Compact And Fast Speech Separation Model Based On Transformer (2021)8.82
- Resource-efficient Separation Transformer (2022)7.81