Towards The Next Frontier In Speech Representation Learning Using Disentanglement
2024 Β· Varun Krishna, Sriram Ganapathy
Abstract
The popular frameworks for self-supervised learning of speech representations have largely focused on frame-level masked prediction of speech regions. While this has shown promising downstream task performance for speech recognition and related tasks, this has largely ignored factors of speech that are encoded at coarser level, like characteristics of the speaker or channel that remain consistent through-out a speech utterance. In this work, we propose a framework for Learning Disentangled Self Supervised (termed as Learn2Diss) representations of speech, which consists of frame-level and an utterance-level encoder modules. The two encoders are initially learned independently, where the frame-level model is largely inspired by existing self supervision techniques, thereby learning pseudo-phonemic representations, while the utterance-level encoder is inspired by constrastive learning of pooled embeddings, thereby learning pseudo-speaker representations. The joint learning of these two mo
Authors
(none)
Tags
Stats
Related papers
- Contentvec: An Improved Self-supervised Speech Representation By Disentangling Speakers (2022)0.00
- Self-supervised Disentangled Representation Learning For Robust Target Speech Extraction (2023)5.24
- Intra-class Variation Reduction Of Speaker Representation In Disentanglement Framework (2020)8.35
- Disentangled Representation Learning For Environment-agnostic Speaker Recognition (2024)4.82
- Learning Disentangled Speech Representations (2023)0.00
- 3d-speaker: A Large-scale Multi-device, Multi-distance, And Multi-dialect Corpus For Speech Representation Disentanglement (2023)0.00
- Disentangling Voice And Content With Self-supervision For Speaker Recognition (2023)2.26
- Improving Unsupervised Subword Modeling Via Disentangled Speech Representation Learning And Transformation (2019)5.24