Improving Multimodal Speech Recognition By Data Augmentation And Speech Representations
2022 Β· Dan Oneata, Horia Cucu
Abstract
Multimodal speech recognition aims to improve the performance of automatic speech recognition (ASR) systems by leveraging additional visual information that is usually associated to the audio input. While previous approaches make crucial use of strong visual representations, e.g. by finetuning pretrained image recognition networks, significantly less attention has been paid to its counterpart: the speech component. In this work, we investigate ways of improving the base speech recognition system by following similar techniques to the ones used for the visual encoder, namely, transferring representations and data augmentation. First, we show that starting from a pretrained ASR significantly improves the state-of-the-art performance; remarkably, even when building upon a strong unimodal system, we still find gains by including the visual modality. Second, we employ speech data augmentation techniques to encourage the multimodal system to attend to the visual stimuli. This technique repla
Authors
(none)
Tags
Stats
Related papers
- Multimodal Grounding For Sequence-to-sequence Speech Recognition (2018)8.82
- You Do Not Need More Data: Improving End-to-end Speech Recognition By Text-to-speech Data Augmentation (2020)11.49
- Improving Audio-visual Speech Recognition By Lip-subword Correlation Based Visual Pre-training And Cross-modal Fusion Encoder (2023)6.34
- Leveraging Unimodal Self-supervised Learning For Multimodal Audio-visual Speech Recognition (2022)11.29
- Analyzing Utility Of Visual Context In Multimodal Speech Recognition Under Noisy Conditions (2019)0.00
- Speaker Verification-derived Loss And Data Augmentation For Dnn-based Multispeaker Speech Synthesis (2021)3.58
- Speech Recognition With Augmented Synthesized Speech (2019)13.97
- A Study On Joint Modeling And Data Augmentation Of Multi-modalities For Audio-visual Scene Classification (2022)5.24