Analyzing Hidden Representations In End-to-end Automatic Speech Recognition Systems
2017 Β· Yonatan Belinkov, James Glass
Abstract
Neural models have become ubiquitous in automatic speech recognition systems. While neural networks are typically used as acoustic models in more complex systems, recent studies have explored end-to-end speech recognition systems based on neural networks, which can be trained to directly predict text from input acoustic features. Although such systems are conceptually elegant and simpler than traditional systems, it is less obvious how to interpret the trained models. In this work, we analyze the speech representations learned by a deep end-to-end model that is based on convolutional and recurrent layers, and trained with a connectionist temporal classification (CTC) loss. We use a pre-trained model to generate frame-level features which are given to a classifier that is trained on frame classification into phones. We evaluate representations from different layers of the deep model and compare their quality for predicting phone labels. Our experiments shed light on important aspects of
Authors
(none)
Tags
Stats
Related papers
- Analyzing Phonetic And Graphemic Representations In End-to-end Automatic Speech Recognition (2019)9.23
- Residual Convolutional CTC Networks For Automatic Speech Recognition (2017)0.00
- What Does A Network Layer Hear? Analyzing Hidden Representations Of End-to-end ASR Through Speech Synthesis (2019)9.41
- Exploring Neural Transducers For End-to-end Speech Recognition (2017)14.90
- End-to-end Neural Segmental Models For Speech Recognition (2017)9.23
- Advances In All-neural Speech Recognition (2016)11.29
- A Review Of On-device Fully Neural End-to-end Automatic Speech Recognition Algorithms (2020)9.92
- Streaming End-to-end Speech Recognition For Mobile Devices (2018)18.87