Learning Waveform-based Acoustic Models Using Deep Variational Convolutional Neural Networks
2019 Β· Dino Oglic, Zoran Cvetkovic, Peter Sollich
Abstract
We investigate the potential of stochastic neural networks for learning effective waveform-based acoustic models. The waveform-based setting, inherent to fully end-to-end speech recognition systems, is motivated by several comparative studies of automatic and human speech recognition that associate standard non-adaptive feature extraction techniques with information loss which can adversely affect robustness. Stochastic neural networks, on the other hand, are a class of models capable of incorporating rich regularization mechanisms into the learning process. We consider a deep convolutional neural network that first decomposes speech into frequency sub-bands via an adaptive parametric convolutional block where filters are specified by cosine modulations of compactly supported windows. The network then employs standard non-parametric 1D convolutions to extract relevant spectro-temporal patterns while gradually compressing the structured high dimensional representation generated by the p
Authors
(none)
Tags
Stats
Related papers
- Wasserstein GAN And Waveform Loss-based Acoustic Model Training For Multi-speaker Text-to-speech Synthesis Systems Using A Wavenet Vocoder (2018)12.61
- A Comparison Of Recent Waveform Generation And Acoustic Modeling Methods For Neural-network-based Speech Synthesis (2018)11.76
- Learning Multiscale Features Directly From Waveforms (2016)0.00
- High-fidelity And Low-latency Universal Neural Vocoder Based On Multiband Wavernn With Data-driven Linear Prediction For Discrete Waveform Modeling (2021)6.77
- Generative Adversarial Network-based Glottal Waveform Model For Statistical Parametric Speech Synthesis (2019)10.35
- Raw Waveform-based Speech Enhancement By Fully Convolutional Networks (2017)16.63
- Lvcnet: Efficient Condition-dependent Modeling Network For Waveform Generation (2021)8.09
- Rawnet: Advanced End-to-end Deep Neural Network Using Raw Waveforms For Text-independent Speaker Verification (2019)15.34