Learning And Controlling The Source-filter Representation Of Speech With A Variational Autoencoder
2022 Β· Samir Sadok, Simon Leglaive, Laurent Girin, et al.
Abstract
Understanding and controlling latent representations in deep generative models is a challenging yet important problem for analyzing, transforming and generating various types of data. In speech processing, inspiring from the anatomical mechanisms of phonation, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent factors, among which the fundamental frequency \(f_0\) and the formants are of primary importance. In this work, we start from a variational autoencoder (VAE) trained in an unsupervised manner on a large dataset of unlabeled natural speech signals, and we show that the source-filter model of speech production naturally arises as orthogonal subspaces of the VAE latent space. Using only a few seconds of labeled speech signals generated with an artificial speech synthesizer, we propose a method to identify the latent subspaces encoding \(f_0\) and the first three formant frequencies, we show that thes
Authors
(none)
Tags
Stats
Related papers
- Learning Latent Representations For Speech Generation And Transformation (2017)13.50
- Deep Encoder-decoder Models For Unsupervised Learning Of Controllable Speech Synthesis (2018)0.00
- Variational Autoencoders For Learning Latent Representations Of Speech Emotion: A Preliminary Study (2017)13.11
- A Statistically Principled And Computationally Efficient Approach To Speech Enhancement Using Variational Autoencoders (2019)9.23
- Learning Robust Speech Representation With An Articulatory-regularized Variational Autoencoder (2021)3.58
- Learning Latent Representations For Style Control And Transfer In End-to-end Speech Synthesis (2018)0.00
- Unsupervised Speech Representation Learning Using Wavenet Autoencoders (2019)17.21
- A Benchmark Of Dynamical Variational Autoencoders Applied To Speech Spectrogram Modeling (2021)6.77