Abstract

Transformer based models have provided significant performance improvements in monaural speech separation. However, there is still a performance gap compared to a recent proposed upper bound. The major limitation of the current dual-path Transformer models is the inefficient modelling of long-range elemental interactions and local feature patterns. In this work, we achieve the upper bound by proposing a gated single-head transformer architecture with convolution-augmented joint self-attentions, named \textit\{MossFormer\} (\textit\{Mo\}naural \textit\{s\}peech \textit\{s\}eparation Trans\textit\{Former\}). To effectively solve the indirect elemental interactions across chunks in the dual-path architecture, MossFormer employs a joint local and global self-attention architecture that simultaneously performs a full-computation self-attention on local chunks and a linearised low-cost self-attention over the full sequence. The joint attention enables MossFormer model full-sequence elemental

Authors

(none)

Tags

  • Speech Translation
  • Text-to-Speech
  • Speech Recognition

Stats

  • citations63
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score13.55
  • arxiv keyzhao2023mossformer

Related papers