Unsupervised Speech Representation Learning Using Wavenet Autoencoders
2019 Β· Jan Chorowski, Ron J. Weiss, Samy Bengio, et al.
Abstract
We consider the task of unsupervised extraction of meaningful latent representations of speech by applying autoencoding neural networks to speech waveforms. The goal is to learn a representation able to capture high level semantic content from the signal, e.g.\ phoneme identities, while being invariant to confounding low level details in the signal such as the underlying pitch contour or background noise. Since the learned representation is tuned to contain only phonetic content, we resort to using a high capacity WaveNet decoder to infer information discarded by the encoder from previous samples. Moreover, the behavior of autoencoder models depends on the kind of constraint that is applied to the latent representation. We compare three variants: a simple dimensionality reduction bottleneck, a Gaussian Variational Autoencoder (VAE), and a discrete Vector Quantized VAE (VQ-VAE). We analyze the quality of learned representations in terms of speaker independence, the ability to predict ph
Authors
(none)
Tags
Stats
Related papers
- Unsupervised Acoustic Unit Representation Learning For Voice Conversion Using Wavenet Auto-encoders (2020)7.16
- Learning Latent Representations For Speech Generation And Transformation (2017)13.50
- Learning And Controlling The Source-filter Representation Of Speech With A Variational Autoencoder (2022)7.50
- An Unsupervised Autoregressive Model For Speech Representation Learning (2019)17.26
- Wav2vec: Unsupervised Pre-training For Speech Recognition (2019)0.00
- Robust Training Of Vector Quantized Bottleneck Models (2020)11.29
- Wav2vec 2.0: A Framework For Self-supervised Learning Of Speech Representations (2020)0.00
- Vq-wav2vec: Self-supervised Learning Of Discrete Speech Representations (2019)0.00