Transformer-based Video Front-ends For Audio-visual Speech Recognition For Single And Multi-person Video
2022 Β· Dmitriy Serdyuk, Otavio Braga, Olivier Siohan
Abstract
Audio-visual automatic speech recognition (AV-ASR) extends speech recognition by introducing the video modality as an additional source of information. In this work, the information contained in the motion of the speaker's mouth is used to augment the audio features. The video modality is traditionally processed with a 3D convolutional neural network (e.g. 3D version of VGG). Recently, image transformer networks arXiv:2010.11929 demonstrated the ability to extract rich visual features for image classification tasks. Here, we propose to replace the 3D convolution with a video transformer to extract visual features. We train our baselines and the proposed model on a large scale corpus of YouTube videos. The performance of our approach is evaluated on a labeled subset of YouTube videos as well as on the LRS3-TED public corpus. Our best video-only model obtains 31.4% WER on YTDEV18 and 17.0% on LRS3-TED, a 10% and 15% relative improvements over our convolutional baseline. We achieve the st
Authors
(none)
Tags
Stats
Related papers
- Multilingual Audio-visual Speech Recognition With Hybrid CTC/RNN-T Fast Conformer (2024)8.60
- Efficient Selective Audio Masked Multimodal Bottleneck Transformer For Audio-video Classification (2024)0.00
- Avformer: Injecting Vision Into Frozen Speech Models For Zero-shot AV-ASR (2023)7.81
- Robust End-to-end Deep Audiovisual Speech Recognition (2016)0.00
- End-to-end Multi-talker Audio-visual ASR Using An Active Speaker Attention Module (2022)0.00
- Recurrent Neural Network Transducer For Audio-visual Speech Recognition (2019)0.00
- Attentive Fusion Enhanced Audio-visual Encoding For Transformer Based Robust Speech Recognition (2020)0.00
- Leveraging Unimodal Self-supervised Learning For Multimodal Audio-visual Speech Recognition (2022)11.29