Unsupervised Vision-and-language Pre-training Via Retrieval-based Multi-granular Alignment
2022 Β· Mingyang Zhou, Licheng Yu, Amanpreet Singh, et al.
Abstract
Vision-and-Language (V+L) pre-training models have achieved tremendous success in recent years on various multi-modal benchmarks. However, the majority of existing models require pre-training on a large set of parallel image-text data, which is costly to collect, compared to image-only or text-only data. In this paper, we explore unsupervised Vision-and-Language pre-training (UVLP) to learn the cross-modal representation from non-parallel image and text datasets. We found two key factors that lead to good unsupervised V+L pre-training without parallel data: (i) joint image-and-text input (ii) overall image-text alignment (even for non-parallel data). Accordingly, we propose a novel unsupervised V+L pre-training curriculum for non-parallel texts and images. We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks, including region-to-tag, region-to-phrase, and image-to-sentence alignment, to bri
Authors
(none)
Tags
Stats
Related papers
- Pitl: Cross-modal Retrieval With Weakly-supervised Vision-language Pre-training Via Prompting (2023)7.16
- Learning By Hallucinating: Vision-language Pre-training With Weak Supervision (2022)4.52
- Weakly Supervised Vision-and-language Pre-training With Relative Representations (2023)3.58
- COTS: Collaborative Two-stream Vision-language Pre-training Model For Cross-modal Retrieval (2022)13.60
- Mllms-augmented Visual-language Representation Learning (2023)0.00
- Hivlp: Hierarchical Vision-language Pre-training For Fast Image-text Retrieval (2022)0.00
- Infusing Fine-grained Visual Knowledge To Vision-language Models (2025)0.00
- CAVL: Learning Contrastive And Adaptive Representations Of Vision And Language (2023)0.00