Disentangled Speech Representation Learning Based On Factorized Hierarchical Variational Autoencoder With Self-supervised Objective
2022 Β· Yuying Xie, Thomas Arildsen, Zheng-Hua Tan
Abstract
Disentangled representation learning aims to extract explanatory features or factors and retain salient information. Factorized hierarchical variational autoencoder (FHVAE) presents a way to disentangle a speech signal into sequential-level and segmental-level features, which represent speaker identity and speech content information, respectively. As a self-supervised objective, autoregressive predictive coding (APC), on the other hand, has been used in extracting meaningful and transferable speech features for multiple downstream tasks. Inspired by the success of these two representation learning methods, this paper proposes to integrate the APC objective into the FHVAE framework aiming at benefiting from the additional self-supervision target. The main proposed method requires neither more training data nor more computational cost at test time, but obtains improved meaningful representations while maintaining disentanglement. The experiments were conducted on the TIMIT dataset. Resul
Authors
(none)
Tags
Stats
Related papers
- Improved Disentangled Speech Representations Using Contrastive Learning In Factorized Hierarchical Variational Autoencoder (2022)2.26
- Unsupervised Representation Learning Of Speech For Dialect Identification (2018)7.16
- Self-supervised Neural Factor Analysis For Disentangling Utterance-level Speech Representations (2023)0.00
- Adversarially Learning Disentangled Speech Representations For Robust Multi-factor Voice Conversion (2021)9.92
- Mixture Factorized Auto-encoder For Unsupervised Hierarchical Deep Factorization Of Speech Signal (2019)0.00
- Contentvec: An Improved Self-supervised Speech Representation By Disentangling Speakers (2022)0.00
- Weak-supervised Dysarthria-invariant Features For Spoken Language Understanding Using An FHVAE And Adversarial Training (2022)2.26
- Scalable Factorized Hierarchical Variational Autoencoder Training (2018)7.81