What Does A Network Layer Hear? Analyzing Hidden Representations Of End-to-end ASR Through Speech Synthesis
2019 Β· Chung-Yi Li, Pei-Chieh Yuan, Hung-Yi Lee
Abstract
End-to-end speech recognition systems have achieved competitive results compared to traditional systems. However, the complex transformations involved between layers given highly variable acoustic signals are hard to analyze. In this paper, we present our ASR probing model, which synthesizes speech from hidden representations of end-to-end ASR to examine the information maintain after each layer calculation. Listening to the synthesized speech, we observe gradual removal of speaker variability and noise as the layer goes deeper, which aligns with the previous studies on how deep network functions in speech recognition. This paper is the first study analyzing the end-to-end speech recognition model by demonstrating what each layer hears. Speaker verification and speech enhancement measurements on synthesized speech are also conducted to confirm our observation further.
Authors
(none)
Tags
Stats
Related papers
- Analyzing Phonetic And Graphemic Representations In End-to-end Automatic Speech Recognition (2019)9.23
- Analyzing Hidden Representations In End-to-end Automatic Speech Recognition Systems (2017)0.00
- Visualizing Automatic Speech Recognition -- Means For A Better Understanding? (2022)4.52
- Deep Residual Neural Networks For Audio Spoofing Detection (2019)0.00
- Tied Hidden Factors In Neural Networks For End-to-end Speaker Recognition (2018)2.26
- A Comparison Of End-to-end Models For Long-form Speech Recognition (2019)12.93
- Representation Selective Self-distillation And Wav2vec 2.0 Feature Exploration For Spoof-aware Speaker Verification (2022)9.03
- An Investigation Of End-to-end Multichannel Speech Recognition For Reverberant And Mismatch Conditions (2019)0.00