Removing Speaker Information From Speech Representation Using Variable-length Soft Pooling
2024 Β· Injune Hwang, Kyogu Lee
Abstract
Recently, there have been efforts to encode the linguistic information of speech using a self-supervised framework for speech synthesis. However, predicting representations from surrounding representations can inadvertently entangle speaker information in the speech representation. This paper aims to remove speaker information by exploiting the structured nature of speech, composed of discrete units like phonemes with clear boundaries. A neural network predicts these boundaries, enabling variable-length pooling for event-based representation extraction instead of fixed-rate methods. The boundary predictor outputs a probability for the boundary between 0 and 1, making pooling soft. The model is trained to minimize the difference with the pooled representation of the data augmented by time-stretch and pitch-shift. To confirm that the learned representation includes contents information but is independent of speaker information, the model was evaluated with libri-light's phonetic ABX task
Authors
(none)
Tags
Stats
Related papers
- Recursive Attentive Pooling For Extracting Speaker Embeddings From Multi-speaker Recordings (2024)2.26
- Exploring The Encoding Layer And Loss Function In End-to-end Speaker And Language Recognition System (2018)17.07
- Self-supervised Predictive Coding Models Encode Speaker And Phonetic Information In Orthogonal Subspaces (2023)7.16
- Contentvec: An Improved Self-supervised Speech Representation By Disentangling Speakers (2022)0.00
- Deep Speaker Embedding Learning With Multi-level Pooling For Text-independent Speaker Verification (2019)0.00
- Spatial Pyramid Encoding With Convex Length Normalization For Text-independent Speaker Verification (2019)8.82
- An Unsupervised Autoregressive Model For Speech Representation Learning (2019)17.26
- Intra-class Variation Reduction Of Speaker Representation In Disentanglement Framework (2020)8.35