Analyzing Learned Representations Of A Deep ASR Performance Prediction Model
2018 Β· Zied Elloumi, Laurent Besacier, Olivier Galibert, et al.
Abstract
This paper addresses a relatively new task: prediction of ASR performance on unseen broadcast programs. In a previous paper, we presented an ASR performance prediction system using CNNs that encode both text (ASR transcript) and speech, in order to predict word error rate. This work is dedicated to the analysis of speech signal embeddings and text embeddings learnt by the CNN while training our prediction model. We try to better understand which information is captured by the deep model and its relation with different conditioning factors. It is shown that hidden layers convey a clear signal about speech style, accent and broadcast type. We then try to leverage these 3 types of information at training time through multi-task learning. Our experiments show that this allows to train slightly more efficient ASR performance prediction systems that - in addition - simultaneously tag the analyzed utterances according to their speech style, accent and broadcast program origin.
Authors
(none)
Tags
Stats
Related papers
- ASR Performance Prediction On Unseen Broadcast Programs Using Convolutional Neural Networks (2018)3.58
- Analyzing Deep Cnn-based Utterance Embeddings For Acoustic Model Adaptation (2018)6.77
- CTA-RNN: Channel And Temporal-wise Attention RNN Leveraging Pre-trained ASR Embeddings For Speech Emotion Recognition (2022)5.84
- Analyzing Hidden Representations In End-to-end Automatic Speech Recognition Systems (2017)0.00
- ASAPP-ASR: Multistream CNN And Self-attentive SRU For SOTA Speech Recognition (2020)9.03
- Visualizing Automatic Speech Recognition -- Means For A Better Understanding? (2022)4.52
- Content-context Factorized Representations For Automated Speech Recognition (2022)6.34
- An Empirical Analysis Of Deep Audio-visual Models For Speech Recognition (2018)0.00