Learning By Hallucinating: Vision-language Pre-training With Weak Supervision
2022 Β· Tzu-Jui Julius Wang, Jorma Laaksonen, Tomas Langer, et al.
Abstract
Weakly-supervised vision-language (V-L) pre-training (W-VLP) aims at learning cross-modal alignment with little or no paired data, such as aligned images and captions. Recent W-VLP methods, which pair visual features with object tags, help achieve performances comparable with some VLP models trained with aligned pairs in various V-L downstream tasks. This, however, is not the case in cross-modal retrieval (XMR). We argue that the learning of such a W-VLP model is curbed and biased by the object tags of limited semantics. We address the lack of paired V-L data for model supervision with a novel Visual Vocabulary based Feature Hallucinator (WFH), which is trained via weak supervision as a W-VLP model, not requiring images paired with captions. WFH generates visual hallucinations from texts, which are then paired with the originally unpaired texts, allowing more diverse interactions across modalities. Empirically, WFH consistently boosts the prior W-VLP works, e.g. U-VisualBERT (U-VB)
Authors
(none)
Tags
Stats
Related papers
- Pitl: Cross-modal Retrieval With Weakly-supervised Vision-language Pre-training Via Prompting (2023)7.16
- Unsupervised Vision-and-language Pre-training Via Retrieval-based Multi-granular Alignment (2022)10.48
- Weakly Supervised Vision-and-language Pre-training With Relative Representations (2023)3.58
- Is Multimodal Vision Supervision Beneficial To Language? (2023)0.00
- Alleviating Hallucination In Large Vision-language Models With Active Retrieval Augmentation (2024)7.16
- Mllms-augmented Visual-language Representation Learning (2023)0.00
- Vision-language Modelling For Radiological Imaging And Reports In The Low Data Regime (2023)0.00
- Vilbert: Pretraining Task-agnostic Visiolinguistic Representations For Vision-and-language Tasks (2019)0.00