Is Multimodal Vision Supervision Beneficial To Language?
2023 Β· Avinash Madasu, Vasudev Lal
Abstract
Vision (image and video) - Language (VL) pre-training is the recent popular paradigm that achieved state-of-the-art results on multi-modal tasks like image-retrieval, video-retrieval, visual question answering etc. These models are trained in an unsupervised way and greatly benefit from the complementary modality supervision. In this paper, we explore if the language representations trained using vision supervision perform better than vanilla language representations on Natural Language Understanding and commonsense reasoning benchmarks. We experiment with a diverse set of image-text models such as ALBEF, BLIP, METER and video-text models like ALPRO, Frozen-in-Time (FiT), VIOLET. We compare the performance of language representations of stand-alone text encoders of these models to the language representations of text encoders learnt through vision supervision. Our experiments suggest that vanilla language representations show superior performance on most of the tasks. These results she
Authors
(none)
Tags
Stats
Related papers
- 12-in-1: Multi-task Vision And Language Representation Learning (2019)17.85
- A Comprehensive Empirical Study Of Vision-language Pre-trained Model For Supervised Cross-modal Retrieval (2022)0.00
- Mllms-augmented Visual-language Representation Learning (2023)0.00
- Vilbert: Pretraining Task-agnostic Visiolinguistic Representations For Vision-and-language Tasks (2019)0.00
- Language Features Matter: Effective Language Representations For Vision-language Tasks (2019)8.60
- Goldiclip: The Goldilocks Approach For Balancing Explicit Supervision For Language-image Pretraining (2026)0.00
- Vlmo: Unified Vision-language Pre-training With Mixture-of-modality-experts (2021)6.34
- Learning By Hallucinating: Vision-language Pre-training With Weak Supervision (2022)4.52