End-to-end Speech Recognition From The Raw Waveform
2018 Β· Neil Zeghidour, Nicolas Usunier, Gabriel Synnaeve, et al.
Abstract
State-of-the-art speech recognition systems rely on fixed, hand-crafted features such as mel-filterbanks to preprocess the waveform before the training pipeline. In this paper, we study end-to-end systems trained directly from the raw waveform, building on two alternatives for trainable replacements of mel-filterbanks that use a convolutional architecture. The first one is inspired by gammatone filterbanks (Hoshen et al., 2015; Sainath et al, 2015), and the second one by the scattering transform (Zeghidour et al., 2017). We propose two modifications to these architectures and systematically compare them to mel-filterbanks, on the Wall Street Journal dataset. The first modification is the addition of an instance normalization layer, which greatly improves on the gammatone-based trainable filterbanks and speeds up the training of the scattering-based filterbanks. The second one relates to the low-pass filter used in these approaches. These modifications consistently improve performances
Authors
(none)
Tags
Stats
Related papers
- Fully Convolutional Speech Recognition (2018)0.00
- Learning Multiscale Features Directly From Waveforms (2016)0.00
- Raw Waveform Encoder With Multi-scale Globally Attentive Locally Recurrent Networks For End-to-end Speech Recognition (2021)0.00
- Wav2letter: An End-to-end Convnet-based Speech Recognition System (2016)0.00
- Learning Waveform-based Acoustic Models Using Deep Variational Convolutional Neural Networks (2019)6.77
- Raw Waveform-based Speech Enhancement By Fully Convolutional Networks (2017)16.63
- Detection Of Doctored Speech: Towards An End-to-end Parametric Learn-able Filter Approach (2022)0.00
- End-to-end Whisper To Natural Speech Conversion Using Modified Transformer Network (2020)0.00