AV Taris: Online Audio-visual Speech Recognition
2020 Β· George Sterpu, Naomi Harte
Abstract
In recent years, Automatic Speech Recognition (ASR) technology has approached human-level performance on conversational speech under relatively clean listening conditions. In more demanding situations involving distant microphones, overlapped speech, background noise, or natural dialogue structures, the ASR error rate is at least an order of magnitude higher. The visual modality of speech carries the potential to partially overcome these challenges and contribute to the sub-tasks of speaker diarisation, voice activity detection, and the recovery of the place of articulation, and can compensate for up to 15dB of noise on average. This article develops AV Taris, a fully differentiable neural network model capable of decoding audio-visual speech in real time. We achieve this by connecting two recently proposed models for audio-visual speech integration and online speech recognition, namely AV Align and Taris. We evaluate AV Taris under the same conditions as AV Align and Taris on one of t
Authors
(none)
Tags
Stats
Related papers
- Robust End-to-end Deep Audiovisual Speech Recognition (2016)0.00
- How To Teach Dnns To Pay Attention To The Visual Modality In Speech Recognition (2020)10.97
- Multilingual Audio-visual Speech Recognition With Hybrid CTC/RNN-T Fast Conformer (2024)8.60
- Streaming Audio-visual Speech Recognition With Alignment Regularization (2022)3.58
- Transformer-based Video Front-ends For Audio-visual Speech Recognition For Single And Multi-person Video (2022)11.39
- Alignvsr: Audio-visual Cross-modal Alignment For Visual Speech Recognition (2024)0.00
- Vararray Meets T-sot: Advancing The State Of The Art Of Streaming Distant Conversational Speech Recognition (2022)9.03
- Target Speech Extraction With Pre-trained Av-hubert And Mask-and-recover Strategy (2024)4.52