Multi-stage Speaker Extraction With Utterance And Frame-level Reference Signals
2020 Β· Meng Ge, Chenglin Xu, Longbiao Wang, et al.
Abstract
Speaker extraction requires a sample speech from the target speaker as the reference. However, enrolling a speaker with a long speech is not practical. We propose a speaker extraction technique, that performs in multiple stages to take full advantage of short reference speech sample. The extracted speech in early stages is used as the reference speech for late stages. For the first time, we use frame-level sequential speech embedding as the reference for target speaker. This is a departure from the traditional utterance-based speaker embedding reference. In addition, a signal fusion scheme is proposed to combine the decoded signals in multiple scales with automatically learned weights. Experiments on WSJ0-2mix and its noisy versions (WHAM! and WHAMR!) show that SpEx++ consistently outperforms other state-of-the-art baselines.
Authors
(none)
Tags
Stats
Related papers
- Mc-spex: Towards Effective Speaker Extraction With Multi-scale Interfusion And Conditional Speaker Modulation (2023)9.23
- A Two-stage Speaker Extraction Algorithm Under Adverse Acoustic Conditions Using A Single-microphone (2023)0.00
- Single Microphone Speaker Extraction Using Unified Time-frequency Siamese-unet (2022)3.58
- Time-domain Speech Extraction With Spatial Information And Multi Speaker Conditioning Mechanism (2021)7.81
- Audio-visual Active Speaker Extraction For Sparsely Overlapped Multi-talker Speech (2023)7.50
- USEF-TSE: Universal Speaker Embedding Free Target Speaker Extraction (2024)11.88
- Target Speaker Extraction By Directly Exploiting Contextual Information In The Time-frequency Domain (2024)9.59
- Robust Speaker Extraction Network Based On Iterative Refined Adaptation (2020)0.00