Acoustic-to-word Recognition With Sequence-to-sequence Models
2018 Β· Shruti Palaskar, Florian Metze
Abstract
Acoustic-to-Word recognition provides a straightforward solution to end-to-end speech recognition without needing external decoding, language model re-scoring or lexicon. While character-based models offer a natural solution to the out-of-vocabulary problem, word models can be simpler to decode and may also be able to directly recognize semantically meaningful units. We present effective methods to train Sequence-to-Sequence models for direct word-level recognition (and character-level recognition) and show an absolute improvement of 4.4-5.0% in Word Error Rate on the Switchboard corpus compared to prior work. In addition to these promising results, word-based models are more interpretable than character models, which have to be composed into words using a separate decoding step. We analyze the encoder hidden states and the attention behavior, and show that location-aware attention naturally represents words as a single speech-word-vector, despite spanning multiple frames in the input.
Authors
(none)
Tags
Stats
Related papers
- Towards Better Decoding And Language Model Integration In Sequence To Sequence Models (2016)15.67
- On Using 2D Sequence-to-sequence Models For Speech Recognition (2019)0.00
- Single Headed Attention Based Sequence-to-sequence Model For State-of-the-art Results On Switchboard (2020)0.00
- State-of-the-art Speech Recognition With Sequence-to-sequence Models (2017)21.01
- Word-level Speech Recognition With A Letter To Word Encoder (2019)0.00
- Phoneme Based Neural Transducer For Large Vocabulary Speech Recognition (2020)9.59
- Acoustic-to-word Model Without OOV (2017)9.23
- Instant One-shot Word-learning For Context-specific Neural Sequence-to-sequence Speech Recognition (2021)9.59