Robustness Analysis Of Video-language Models Against Visual And Language Perturbations
2022 Β· Madeline C. Schiappa, Shruti Vyas, Hamid Palangi, et al.
Abstract
Joint visual and language modeling on large-scale datasets has recently shown good progress in multi-modal tasks when compared to single modal learning. However, robustness of these approaches against real-world perturbations has not been studied. In this work, we perform the first extensive robustness study of video-language models against various real-world perturbations. We focus on text-to-video retrieval and propose two large-scale benchmark datasets, MSRVTT-P and YouCook2-P, which utilize 90 different visual and 35 different text perturbations. The study reveals some interesting initial findings from the studied models: 1) models are generally more susceptible when only video is perturbed as opposed to when only text is perturbed, 2) models that are pre-trained are more robust than those trained from scratch, 3) models attend more to scene and objects rather than motion and action. We hope this study will serve as a benchmark and guide future research in robust video-language lea
Authors
(none)
Tags
Stats
Related papers
- Benchmark Granularity And Model Robustness For Image-text Retrieval (2024)0.00
- Towards Holistic Language-video Representation: The Language Model-enhanced Msr-video To Text Dataset (2024)0.00
- Cross-modal Attribute Insertions For Assessing The Robustness Of Vision-and-language Learning (2023)2.00
- Object-aware Query Perturbation For Cross-modal Image-text Retrieval (2024)6.52
- Understanding Retrieval-augmented Task Adaptation For Vision-language Models (2024)0.00
- Distilling Vision-language Models On Millions Of Videos (2024)7.50
- Learning Language-visual Embedding For Movie Understanding With Natural-language (2016)0.00
- Towards Fast Adaptation Of Pretrained Contrastive Models For Multi-channel Video-language Retrieval (2022)7.50