Disentangling Voice And Content With Self-supervision For Speaker Recognition
2023 Β· Tianchi Liu, Kong Aik Lee, Qiongqiong Wang, et al.
Abstract
For speaker recognition, it is difficult to extract an accurate speaker representation from speech because of its mixture of speaker traits and content. This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech. It is realized with the use of three Gaussian inference layers, each consisting of a learnable transition model that extracts distinct speech components. Notably, a strengthened transition model is specifically designed to model complex speech dynamics. We also propose a self-supervision method to dynamically disentangle content without the use of labels other than speaker identities. The efficacy of the proposed framework is validated via experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF, respectively. Since neither additional model training nor data is specifically needed, it is easily applicable in practical use.
Authors
(none)
Tags
Stats
Related papers
- Contentvec: An Improved Self-supervised Speech Representation By Disentangling Speakers (2022)0.00
- Intra-class Variation Reduction Of Speaker Representation In Disentanglement Framework (2020)8.35
- Disentangled Representation Learning For Environment-agnostic Speaker Recognition (2024)4.82
- Self-supervised Disentangled Representation Learning For Robust Target Speech Extraction (2023)5.24
- Speaker And Style Disentanglement Of Speech Based On Contrastive Predictive Coding Supported Factorized Variational Autoencoder (2024)2.26
- Speaker Disentanglement Of Speech Pre-trained Model Based On Interpretability (2025)0.00
- Unsupervised Learning Of Disentangled Speech Content And Style Representation (2020)7.50
- Disentangled Speaker And Nuisance Attribute Embedding For Robust Speaker Verification (2020)8.60