Spectron: Target Speaker Extraction Using Conditional Transformer With Adversarial Refinement
2024 Β· Tathagata Bandyopadhyay
Abstract
Recently, attention-based transformers have become a de facto standard in many deep learning applications including natural language processing, computer vision, signal processing, etc.. In this paper, we propose a transformer-based end-to-end model to extract a target speaker's speech from a monaural multi-speaker mixed audio signal. Unlike existing speaker extraction methods, we introduce two additional objectives to impose speaker embedding consistency and waveform encoder invertibility and jointly train both speaker encoder and speech separator to better capture the speaker conditional embedding. Furthermore, we leverage a multi-scale discriminator to refine the perceptual quality of the extracted speech. Our experiments show that the use of a dual path transformer in the separator backbone along with proposed training paradigm improves the CNN baseline by \(3.12\) dB points. Finally, we compare our approach with recent state-of-the-arts and show that our model outperforms existing
Authors
(none)
Tags
Stats
Related papers
- Speaker-conditioning Single-channel Target Speaker Extraction Using Conformer-based Architectures (2022)6.34
- Target Confusion In End-to-end Speaker Extraction: Analysis And Approaches (2022)9.59
- Voicefilter: Targeted Voice Separation By Speaker-conditioned Spectrogram Masking (2018)17.48
- Dual-path Transformer Based Neural Beamformer For Target Speech Extraction (2023)0.00
- Target Speaker Extraction By Directly Exploiting Contextual Information In The Time-frequency Domain (2024)9.59
- Dual-path Transformer Network: Direct Context-aware Modeling For End-to-end Monaural Speech Separation (2020)18.24
- Speakerfilter-pro: An Improved Target Speaker Extractor Combines The Time Domain And Frequency Domain (2020)5.84
- Speaker-conditioned Target Speaker Extraction Based On Customized LSTM Cells (2021)0.00