USEF-TSE: Universal Speaker Embedding Free Target Speaker Extraction
2024 Β· Bang Zeng, Ming Li
Abstract
Target speaker extraction aims to separate the voice of a specific speaker from mixed speech. Traditionally, this process has relied on extracting a speaker embedding from a reference speech, in which a speaker recognition model is required. However, identifying an appropriate speaker recognition model can be challenging, and using the target speaker embedding as reference information may not be optimal for target speaker extraction tasks. This paper introduces a Universal Speaker Embedding-Free Target Speaker Extraction (USEF-TSE) framework that operates without relying on speaker embeddings. USEF-TSE utilizes a multi-head cross-attention mechanism as a frame-level target speaker feature extractor. This innovative approach allows mainstream speaker extraction solutions to bypass the dependency on speaker recognition models and better leverage the information available in the enrollment speech, including speaker characteristics and contextual details. Additionally, USEF-TSE can seamles
Authors
(none)
Tags
Stats
Related papers
- USEV: Universal Speaker Extraction With Visual Cue (2021)12.17
- Target Speaker Extraction By Directly Exploiting Contextual Information In The Time-frequency Domain (2024)9.59
- Focus On The Sound Around You: Monaural Target Speaker Extraction Via Distance And Speaker Information (2023)7.81
- X-crossnet: A Complex Spectral Mapping Approach To Target Speaker Extraction With Cross Attention Speaker Embedding Fusion (2024)0.00
- 3S-TSE: Efficient Three-stage Target Speaker Extraction For Real-time And Low-resource Applications (2023)5.24
- New Insights On Target Speaker Extraction (2022)0.00
- Quantitative Evidence On Overlooked Aspects Of Enrollment Speaker Embeddings For Target Speaker Separation (2022)7.16
- Multi-stage Speaker Extraction With Utterance And Frame-level Reference Signals (2020)12.54