Convolutional Variational Autoencoders For Spectrogram Compression In Automatic Speech Recognition
2024 Β· Olga Iakovenko, Ivan Bondarenko
Abstract
For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.
Authors
(none)
Tags
Stats
Related papers
- A Benchmark Of Dynamical Variational Autoencoders Applied To Speech Spectrogram Modeling (2021)6.77
- A Statistically Principled And Computationally Efficient Approach To Speech Enhancement Using Variational Autoencoders (2019)9.23
- Learning And Controlling The Source-filter Representation Of Speech With A Variational Autoencoder (2022)7.50
- Improved Far-field Speech Recognition Using Joint Variational Autoencoder (2022)0.00
- Variational Auto-encoder Based Variability Encoding For Dysarthric Speech Recognition (2022)7.16
- Deep Vocoder: Low Bit Rate Compression Of Speech With Deep Autoencoder (2019)5.24
- Audio-visual Speech Enhancement Using Conditional Variational Auto-encoders (2019)13.65
- Voice Conversion From Non-parallel Corpora Using Variational Auto-encoder (2016)16.36