Unifying Vision-language Representation Space With Single-tower Transformer
2022 Β· Jiho Jang, Chaerin Kong, Donghyeon Jeon, et al.
Abstract
Contrastive learning is a form of distance learning that aims to learn invariant features from two related representations. In this paper, we explore the bold hypothesis that an image and its caption can be simply regarded as two different views of the underlying mutual information, and train a model to learn a unified vision-language representation space that encodes both modalities at once in a modality-agnostic manner. We first identify difficulties in learning a generic one-tower model for vision-language pretraining (VLP), and propose OneR as a simple yet effective framework for our goal. We discover intriguing properties that distinguish OneR from the previous works that learn modality-specific representation spaces such as zero-shot object localization, text-guided visual reasoning and multi-modal retrieval, and present analyses to provide insights into this new form of multi-modal representation learning. Thorough evaluations demonstrate the potential of a unified modality-agno
Authors
(none)
Tags
Stats
Related papers
- Unicoder-vl: A Universal Encoder For Vision And Language By Cross-modal Pre-training (2019)20.24
- UFO: A Unified Transformer For Vision-language Representation Learning (2021)0.00
- Multimodal Contrastive Training For Visual Representation Learning (2021)16.32
- EVE: Efficient Vision-language Pre-training With Masked Prediction And Modality-aware Moe (2023)7.50
- 12-in-1: Multi-task Vision And Language Representation Learning (2019)17.85
- CAVL: Learning Contrastive And Adaptive Representations Of Vision And Language (2023)0.00
- Come-vl: Scaling Complementary Multi-encoder Vision-language Learning (2026)0.00
- Exploring A Unified Vision-centric Contrastive Alternatives On Multi-modal Web Documents (2025)1.69