TSELM: Target Speaker Extraction Using Discrete Tokens And Language Models
2024 Β· Beilong Tang, Bang Zeng, Ming Li
Abstract
We propose TSELM, a novel target speaker extraction network that leverages discrete tokens and language models. TSELM utilizes multiple discretized layers from WavLM as input tokens and incorporates cross-attention mechanisms to integrate target speaker information. Language models are employed to capture the sequence dependencies, while a scalable HiFi-GAN is used to reconstruct the audio from the tokens. By applying a cross-entropy loss, TSELM models the probability distribution of output tokens, thus converting the complex regression problem of audio generation into a classification task. Experimental results show that TSELM achieves excellent results in speech quality and comparable results in speech intelligibility.
Authors
(none)
Tags
Stats
Related papers
- SELM: Speech Enhancement Using Discrete Tokens And Language Models (2023)11.19
- 3S-TSE: Efficient Three-stage Target Speaker Extraction For Real-time And Low-resource Applications (2023)5.24
- Typing To Listen At The Cocktail Party: Text-guided Target Speaker Extraction (2023)3.58
- High Fidelity Text-to-speech Via Discrete Tokens Using Token Transducer And Group Masked Language Model (2024)4.52
- Speakerbeam-ss: Real-time Target Speaker Extraction With Lightweight Conv-tasnet And State Space Modeling (2024)7.16
- Lightweight Speech Enhancement Guided Target Speech Extraction In Noisy Multi-speaker Scenarios (2025)0.00
- Soloaudio: Target Sound Extraction With Language-oriented Audio Diffusion Transformer (2024)7.50
- Target Speech Extraction With Pre-trained Self-supervised Learning Models (2024)9.41