Typing To Listen At The Cocktail Party: Text-guided Target Speaker Extraction
2023 Β· Xiang Hao, Jibin Wu, Jianwei Yu, et al.
Abstract
Humans can easily isolate a single speaker from a complex acoustic environment, a capability referred to as the "Cocktail Party Effect." However, replicating this ability has been a significant challenge in the field of target speaker extraction (TSE). Traditional TSE approaches predominantly rely on voiceprints, which raise privacy concerns and face issues related to the quality and availability of enrollment samples, as well as intra-speaker variability. To address these issues, this work introduces a novel text-guided TSE paradigm named LLM-TSE. In this paradigm, a state-of-the-art large language model, LLaMA 2, processes typed text input from users to extract semantic cues. We demonstrate that textual descriptions alone can effectively serve as cues for extraction, thus addressing privacy concerns and reducing dependency on voiceprints. Furthermore, our approach offers flexibility by allowing the user to specify the extraction or suppression of a speaker and enhances robustness aga
Authors
(none)
Tags
Stats
Related papers
- Contextual Speech Extraction: Leveraging Textual History As An Implicit Cue For Target Speech Extraction (2025)2.26
- Continuous Target Speech Extraction: Enhancing Personalized Diarization And Extraction On Complex Recordings (2024)3.58
- TSELM: Target Speaker Extraction Using Discrete Tokens And Language Models (2024)0.00
- Selective Listening By Synchronizing Speech With Lips (2021)11.85
- 3S-TSE: Efficient Three-stage Target Speaker Extraction For Real-time And Low-resource Applications (2023)5.24
- Lightweight Speech Enhancement Guided Target Speech Extraction In Noisy Multi-speaker Scenarios (2025)0.00
- Language-queried Target Sound Extraction Without Parallel Training Data (2024)5.24
- Focus On The Sound Around You: Monaural Target Speaker Extraction Via Distance And Speaker Information (2023)7.81