How To Teach Dnns To Pay Attention To The Visual Modality In Speech Recognition
2020 Β· George Sterpu, Christian Saam, Naomi Harte
Abstract
Audio-Visual Speech Recognition (AVSR) seeks to model, and thereby exploit, the dynamic relationship between a human voice and the corresponding mouth movements. A recently proposed multimodal fusion strategy, AV Align, based on state-of-the-art sequence to sequence neural networks, attempts to model this relationship by explicitly aligning the acoustic and visual representations of speech. This study investigates the inner workings of AV Align and visualises the audio-visual alignment patterns. Our experiments are performed on two of the largest publicly available AVSR datasets, TCD-TIMIT and LRS2. We find that AV Align learns to align acoustic and visual representations of speech at the frame level on TCD-TIMIT in a generally monotonic pattern. We also determine the cause of initially seeing no improvement over audio-only speech recognition on the more challenging LRS2. We propose a regularisation method which involves predicting lip-related Action Units from visual representations.
Authors
(none)
Tags
Stats
Related papers
- Alignvsr: Audio-visual Cross-modal Alignment For Visual Speech Recognition (2024)0.00
- Attention-based Audio-visual Fusion For Robust Automatic Speech Recognition (2018)16.67
- Cross-modal Global Interaction And Local Alignment For Audio-visual Speech Recognition (2023)7.50
- Robust End-to-end Deep Audiovisual Speech Recognition (2016)0.00
- Improving Audio-visual Speech Recognition By Lip-subword Correlation Based Visual Pre-training And Cross-modal Fusion Encoder (2023)6.34
- DCIM-AVSR : Efficient Audio-visual Speech Recognition Via Dual Conformer Interaction Module (2024)3.58
- MLCA-AVSR: Multi-layer Cross Attention Fusion Based Audio-visual Speech Recognition (2024)10.07
- Predict-and-update Network: Audio-visual Speech Recognition Inspired By Human Speech Perception (2022)6.34