Improving Curriculum Learning For Target Speaker Extraction With Synthetic Speakers
2024 Β· Yun Liu, Xuechen Liu, Junichi Yamagishi
Abstract
Target speaker extraction (TSE) aims to isolate individual speaker voices from complex speech environments. The effectiveness of TSE systems is often compromised when the speaker characteristics are similar to each other. Recent research has introduced curriculum learning (CL), in which TSE models are trained incrementally on speech samples of increasing complexity. In CL training, the model is first trained on samples with low speaker similarity between the target and interference speakers, and then on samples with high speaker similarity. To further improve CL, this paper uses a \(k\)-nearest neighbor-based voice conversion method to simulate and generate speech of diverse interference speakers, and then uses the generated data as part of the CL. Experiments demonstrate that training data based on synthetic speakers can effectively enhance the model's capabilities and significantly improve the performance of multiple TSE systems.
Authors
(none)
Tags
Stats
Related papers
- Language-queried Target Sound Extraction Without Parallel Training Data (2024)5.24
- Libri2vox Dataset: Target Speaker Extraction With Diverse Speaker Conditions And Synthetic Data (2024)0.00
- Lightweight Speech Enhancement Guided Target Speech Extraction In Noisy Multi-speaker Scenarios (2025)0.00
- 3S-TSE: Efficient Three-stage Target Speaker Extraction For Real-time And Low-resource Applications (2023)5.24
- X-sepformer: End-to-end Speaker Extraction Network With Explicit Optimization On Speaker Confusion (2023)0.00
- Continuous Target Speech Extraction: Enhancing Personalized Diarization And Extraction On Complex Recordings (2024)3.58
- Developing Far-field Speaker System Via Teacher-student Learning (2018)10.85
- Focus On The Sound Around You: Monaural Target Speaker Extraction Via Distance And Speaker Information (2023)7.81