Voicefilter: Targeted Voice Separation By Speaker-conditioned Spectrogram Masking
2018 Β· Quan Wang, Hannah Muckenhirn, Kevin Wilson, et al.
Abstract
In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.
Authors
(none)
Tags
Stats
Related papers
- Individualized Conditioning And Negative Distances For Speaker Separation (2022)2.26
- Temporal-spatial Neural Filter: Direction Informed End-to-end Multi-channel Target Speech Separation (2020)0.00
- Time-domain Speech Extraction With Spatial Information And Multi Speaker Conditioning Mechanism (2021)7.81
- Speakerfilter-pro: An Improved Target Speaker Extractor Combines The Time Domain And Frequency Domain (2020)5.84
- Personalized Percepnet: Real-time, Low-complexity Target Voice Separation And Enhancement (2021)10.97
- TS-SEP: Joint Diarization And Separation Conditioned On Estimated Speaker Embeddings (2023)10.35
- Monaural Singing Voice Separation With Skip-filtering Connections And Recurrent Inference Of Time-frequency Mask (2017)10.07
- Single-channel Speech Separation With Auxiliary Speaker Embeddings (2019)0.00