Visualizing Automatic Speech Recognition -- Means For A Better Understanding?
2022 Β· Karla Markert, Romain Parracone, Mykhailo Kulakov, et al.
Abstract
Automatic speech recognition (ASR) is improving ever more at mimicking human speech processing. The functioning of ASR, however, remains to a large extent obfuscated by the complex structure of the deep neural networks (DNNs) they are based on. In this paper, we show how so-called attribution methods, that we import from image recognition and suitably adapt to handle audio data, can help to clarify the working of ASR. Taking DeepSpeech, an end-to-end model for ASR, as a case study, we show how these techniques help to visualize which features of the input are the most influential in determining the output. We focus on three visualization techniques: Layer-wise Relevance Propagation (LRP), Saliency Maps, and Shapley Additive Explanations (SHAP). We compare these methods and discuss potential further applications, such as in the detection of adversarial examples.
Authors
(none)
Tags
Stats
Related papers
- Analyzing Phonetic And Graphemic Representations In End-to-end Automatic Speech Recognition (2019)9.23
- Automatic Speech Recognition Using Advanced Deep Learning Approaches: A Survey (2024)16.63
- Gradient-adjusted Neuron Activation Profiles For Comprehensive Introspection Of Convolutional Speech Recognition Models (2020)0.00
- Unsupervised Automatic Speech Recognition: A Review (2021)13.50
- Asr-based Features For Emotion Recognition: A Transfer Learning Approach (2018)9.76
- Audio-attention Discriminative Language Model For ASR Rescoring (2019)9.23
- What Does A Network Layer Hear? Analyzing Hidden Representations Of End-to-end ASR Through Speech Synthesis (2019)9.41
- AV Taris: Online Audio-visual Speech Recognition (2020)0.00