Evaluation Of Audio-visual Alignments In Visually Grounded Speech Models
2021 · Khazar Khorrami, Okko Räsänen
Abstract
Systems that can find correspondences between multiple modalities, such as between speech and images, have great potential to solve different recognition and data analysis tasks in an unsupervised manner. This work studies multimodal learning in the context of visually grounded speech (VGS) models, and focuses on their recently demonstrated capability to extract spatiotemporal alignments between spoken words and the corresponding visual objects without ever been explicitly trained for object localization or word recognition. As the main contributions, we formalize the alignment problem in terms of an audiovisual alignment tensor that is based on earlier VGS work, introduce systematic metrics for evaluating model performance in aligning visual objects and spoken words, and propose a new VGS model variant for the alignment task utilizing cross-modal attention layer. We test our model and a previously proposed model in the alignment task using SPEECH-COCO captions coupled with MSCOCO imag
Authors
(none)
Tags
Stats
Related papers
- Alignvsr: Audio-visual Cross-modal Alignment For Visual Speech Recognition (2024)0.00
- Cross-modal Global Interaction And Local Alignment For Audio-visual Speech Recognition (2023)7.50
- Cross-modal Audio-visual Co-learning For Text-independent Speaker Verification (2023)9.23
- Fine-grained Grounding For Multimodal Speech Recognition (2020)5.84
- Multimodal Grounding For Sequence-to-sequence Speech Recognition (2018)8.82
- Jointly Discovering Visual Objects And Spoken Words From Raw Sensory Input (2018)14.27
- Hindi As A Second Language: Improving Visually Grounded Speech With Semantically Similar Samples (2023)6.77
- How To Teach Dnns To Pay Attention To The Visual Modality In Speech Recognition (2020)10.97