Learning And Controlling The Source-filter Representation Of Speech With A Variational Autoencoder

Abstract

Understanding and controlling latent representations in deep generative models is a challenging yet important problem for analyzing, transforming and generating various types of data. In speech processing, inspiring from the anatomical mechanisms of phonation, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent factors, among which the fundamental frequency \(f_0\) and the formants are of primary importance. In this work, we start from a variational autoencoder (VAE) trained in an unsupervised manner on a large dataset of unlabeled natural speech signals, and we show that the source-filter model of speech production naturally arises as orthogonal subspaces of the VAE latent space. Using only a few seconds of labeled speech signals generated with an artificial speech synthesizer, we propose a method to identify the latent subspaces encoding \(f_0\) and the first three formant frequencies, we show that thes

Learning And Controlling The Source-filter Representation Of Speech With A Variational Autoencoder

Abstract

Authors

Tags

Stats

Related papers