Multi-modal Automated Speech Scoring Using Attention Fusion
2020 Β· Manraj Singh Grover, Yaman Kumar, Sumit Sarin, et al.
Abstract
In this study, we propose a novel multi-modal end-to-end neural approach for automated assessment of non-native English speakers' spontaneous speech using attention fusion. The pipeline employs Bi-directional Recurrent Convolutional Neural Networks and Bi-directional Long Short-Term Memory Neural Networks to encode acoustic and lexical cues from spectrograms and transcriptions, respectively. Attention fusion is performed on these learned predictive features to learn complex interactions between different modalities before final scoring. We compare our model with strong baselines and find combined attention to both lexical and acoustic cues significantly improves the overall performance of the system. Further, we present a qualitative and quantitative analysis of our model.
Authors
(none)
Tags
Stats
Related papers
- AMFFCN: Attentional Multi-layer Feature Fusion Convolution Network For Audio-visual Speech Enhancement (2021)0.00
- Attentive Fusion Enhanced Audio-visual Encoding For Transformer Based Robust Speech Recognition (2020)0.00
- Attention-based Audio-visual Fusion For Robust Automatic Speech Recognition (2018)16.67
- Speech Emotion Recognition Using Multi-hop Attention Mechanism (2019)14.58
- Automatic Quality Assessment For Audio-visual Verification Systems. The Love Submission To NIST SRE Challenge 2019 (2020)0.00
- A Novel Multimodal Dynamic Fusion Network For Disfluency Detection In Spoken Utterances (2022)0.00
- A Joint Cross-attention Model For Audio-visual Fusion In Dimensional Emotion Recognition (2022)18.00
- Audio-visual Speech Separation Based On Joint Feature Representation With Cross-modal Attention (2022)0.00