Avformer: Injecting Vision Into Frozen Speech Models For Zero-shot AV-ASR
2023 Β· Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid
Abstract
Audiovisual automatic speech recognition (AV-ASR) aims to improve the robustness of a speech recognition system by incorporating visual information. Training fully supervised multimodal models for this task from scratch, however is limited by the need for large labelled audiovisual datasets (in each downstream domain of interest). We present AVFormer, a simple method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation. We do this by (i) injecting visual embeddings into a frozen ASR model using lightweight trainable adaptors. We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters. (ii) We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively; and finally (iii) we show that our model achieves state of the art zero-shot results on three dif
Authors
(none)
Tags
Stats
Related papers
- Transformer-based Video Front-ends For Audio-visual Speech Recognition For Single And Multi-person Video (2022)11.39
- XLAVS-R: Cross-lingual Audio-visual Speech Representation Learning For Noise-robust Speech Perception (2024)7.50
- Joint Training Or Not: An Exploration Of Pre-trained Speech Models In Audio-visual Speaker Diarization (2023)0.00
- Multilingual Audio-visual Speech Recognition With Hybrid CTC/RNN-T Fast Conformer (2024)8.60
- Audio-visual Speech Separation In Noisy Environments With A Lightweight Iterative Model (2023)0.00
- Robust End-to-end Deep Audiovisual Speech Recognition (2016)0.00
- Av-data2vec: Self-supervised Learning Of Audio-visual Speech Representations With Contextualized Target Representations (2023)0.00
- Learning Contextually Fused Audio-visual Representations For Audio-visual Speech Recognition (2022)6.77