X-crossnet: A Complex Spectral Mapping Approach To Target Speaker Extraction With Cross Attention Speaker Embedding Fusion
2024 Β· Chang Sun, Bo Qin
Abstract
Target speaker extraction (TSE) is a technique for isolating a target speaker's voice from mixed speech using auxiliary features associated with the target speaker. It is another attempt at addressing the cocktail party problem and is generally considered to have more practical application prospects than traditional speech separation methods. Although academic research in this area has achieved high performance and evaluation scores on public datasets, most models exhibit significantly reduced performance in real-world noisy or reverberant conditions. To address this limitation, we propose a novel TSE model, X-CrossNet, which leverages CrossNet as its backbone. CrossNet is a speech separation network specifically optimized for challenging noisy and reverberant environments, achieving state-of-the-art performance in tasks such as speaker separation under these conditions. Additionally, to enhance the network's ability to capture and utilize auxiliary features of the target speaker, we i
Authors
(none)
Tags
Stats
Related papers
- Crossnet: Leveraging Global, Cross-band, Narrow-band, And Positional Encoding For Single- And Multi-channel Speaker Separation (2024)0.00
- X-tasnet: Robust And Accurate Time-domain Speaker Extraction Network (2020)10.48
- Focus On The Sound Around You: Monaural Target Speaker Extraction Via Distance And Speaker Information (2023)7.81
- X-sepformer: End-to-end Speaker Extraction Network With Explicit Optimization On Speaker Confusion (2023)0.00
- 3S-TSE: Efficient Three-stage Target Speaker Extraction For Real-time And Low-resource Applications (2023)5.24
- USEF-TSE: Universal Speaker Embedding Free Target Speaker Extraction (2024)11.88
- Speakerbeam-ss: Real-time Target Speaker Extraction With Lightweight Conv-tasnet And State Space Modeling (2024)7.16
- New Insights On Target Speaker Extraction (2022)0.00