Learning Speaker Representations With Mutual Information
2018 Β· Mirco Ravanelli, Yoshua Bengio
Abstract
Learning good representations is of crucial importance in deep learning. Mutual Information (MI) or similar measures of statistical dependence are promising tools for learning these representations in an unsupervised way. Even though the mutual information between two random variables is hard to measure directly in high dimensional spaces, some recent studies have shown that an implicit optimization of MI can be achieved with an encoder-discriminator architecture similar to that of Generative Adversarial Networks (GANs). In this work, we learn representations that capture speaker identities by maximizing the mutual information between the encoded representations of chunks of speech randomly sampled from the same sentence. The proposed encoder relies on the SincNet architecture and transforms raw speech waveform into a compact feature vector. The discriminator is fed by either positive samples (of the joint distribution of encoded chunks) or negative samples (from the product of the mar
Authors
(none)
Tags
Stats
Related papers
- Revisiting Self-supervised Learning Of Speech Representation From A Mutual Information Perspective (2024)4.52
- Disentangled Speaker Representation Learning Via Mutual Information Minimization (2022)5.24
- Mirnet: Learning Multiple Identities Representations In Overlapped Speech (2020)5.84
- Label-efficient Self-supervised Speaker Verification With Information Maximization And Contrastive Learning (2022)6.77
- Improving Speech Emotion Recognition With Mutual Information Regularized Generative Model (2025)0.00
- VQMIVC: Vector Quantization And Mutual Information-based Unsupervised Speech Representation Disentanglement For One-shot Voice Conversion (2021)20.31
- Learning Modality-invariant Representations For Speech And Images (2017)8.09
- Learning Problem-agnostic Speech Representations From Multiple Self-supervised Tasks (2019)15.54