Quantitative Evidence On Overlooked Aspects Of Enrollment Speaker Embeddings For Target Speaker Separation
2022 Β· Xiaoyu Liu, Xu Li, Joan SerrΓ
Abstract
Single channel target speaker separation (TSS) aims at extracting a speaker's voice from a mixture of multiple talkers given an enrollment utterance of that speaker. A typical deep learning TSS framework consists of an upstream model that obtains enrollment speaker embeddings and a downstream model that performs the separation conditioned on the embeddings. In this paper, we look into several important but overlooked aspects of the enrollment embeddings, including the suitability of the widely used speaker identification embeddings, the introduction of the log-mel filterbank and self-supervised embeddings, and the embeddings' cross-dataset generalization capability. Our results show that the speaker identification embeddings could lose relevant information due to a sub-optimal metric, training objective, or common pre-processing. In contrast, both the filterbank and the self-supervised embeddings preserve the integrity of the speaker information, but the former consistently outperforms
Authors
(none)
Tags
Stats
Related papers
- Investigation Of Speaker Representation For Target-speaker Speech Processing (2024)4.52
- Target Speaker Extraction By Directly Exploiting Contextual Information In The Time-frequency Domain (2024)9.59
- Target Confusion In End-to-end Speaker Extraction: Analysis And Approaches (2022)9.59
- Adapting Self-supervised Models To Multi-talker Speech Recognition Using Speaker Embeddings (2022)10.61
- An Analysis On The Effects Of Speaker Embedding Choice In Non Auto-regressive TTS (2023)0.00
- New Insights On Target Speaker Extraction (2022)0.00
- Supervised Speaker Embedding De-mixing In Two-speaker Environment (2020)0.00
- USEF-TSE: Universal Speaker Embedding Free Target Speaker Extraction (2024)11.88