Recurrent Neural Network Transducer For Audio-visual Speech Recognition
2019 Β· Takaki Makino, Hank Liao, Yannis Assael, et al.
Abstract
This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (A/V) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and audio-visual system are compared on two large-vocabulary test sets: a set of utterance segments from public YouTube videos called YTDEV18 and the publicly available LRS3-TED set. To highlight the contribution of the visual modality, we also evaluated the performance of our system on the YTDEV18 set artificially corrupted with background noise and overlapping speech. To the best of our knowledge, our system significantly improves the state-of-the-art on the LRS3-TED set.
Authors
(none)
Tags
Stats
Related papers
- Robust End-to-end Deep Audiovisual Speech Recognition (2016)0.00
- Audio Visual Speech Recognition Using Deep Recurrent Neural Networks (2016)7.81
- Transformer-based Video Front-ends For Audio-visual Speech Recognition For Single And Multi-person Video (2022)11.39
- Large-scale Visual Speech Recognition (2018)14.43
- Multilingual Audio-visual Speech Recognition With Hybrid CTC/RNN-T Fast Conformer (2024)8.60
- XLAVS-R: Cross-lingual Audio-visual Speech Representation Learning For Noise-robust Speech Perception (2024)7.50
- Exploring Architectures, Data And Units For Streaming End-to-end Speech Recognition With Rnn-transducer (2018)16.21
- Attentive Fusion Enhanced Audio-visual Encoding For Transformer Based Robust Speech Recognition (2020)0.00