Mixture Factorized Auto-encoder For Unsupervised Hierarchical Deep Factorization Of Speech Signal
2019 Β· Zhiyuan Peng, Siyuan Feng, Tan Lee
Abstract
Speech signal is constituted and contributed by various informative factors, such as linguistic content and speaker characteristic. There have been notable recent studies attempting to factorize speech signal into these individual factors without requiring any annotation. These studies typically assume continuous representation for linguistic content, which is not in accordance with general linguistic knowledge and may make the extraction of speaker information less successful. This paper proposes the mixture factorized auto-encoder (mFAE) for unsupervised deep factorization. The encoder part of mFAE comprises a frame tokenizer and an utterance embedder. The frame tokenizer models linguistic content of input speech with a discrete categorical distribution. It performs frame clustering by assigning each frame a soft mixture label. The utterance embedder generates an utterance-level vector representation. A frame decoder serves to reconstruct speech features from the encoders'outputs. Th
Authors
(none)
Tags
Stats
Related papers
- Deep Factorization For Speech Signal (2018)8.82
- Disentangled Speech Representation Learning Based On Factorized Hierarchical Variational Autoencoder With Self-supervised Objective (2022)7.81
- Deep Generative Factorization For Speech Signal (2020)0.00
- Unsupervised Representation Learning Of Speech For Dialect Identification (2018)7.16
- Self-supervised Neural Factor Analysis For Disentangling Utterance-level Speech Representations (2023)0.00
- Improved Disentangled Speech Representations Using Contrastive Learning In Factorized Hierarchical Variational Autoencoder (2022)2.26
- Content-context Factorized Representations For Automated Speech Recognition (2022)6.34
- Scalable Factorized Hierarchical Variational Autoencoder Training (2018)7.81