Pitl: Cross-modal Retrieval With Weakly-supervised Vision-language Pre-training Via Prompting
2023 Β· Zixin Guo, Tzu-Jui Julius Wang, Selen Pehlivan, et al.
Abstract
Vision-language (VL) Pre-training (VLP) has shown to well generalize VL models over a wide range of VL downstream tasks, especially for cross-modal retrieval. However, it hinges on a huge amount of image-text pairs, which requires tedious and costly curation. On the contrary, weakly-supervised VLP (W-VLP) explores means with object tags generated by a pre-trained object detector (OD) from images. Yet, they still require paired information, i.e. images and object-level annotations, as supervision to train an OD. To further reduce the amount of supervision, we propose Prompts-in-The-Loop (PiTL) that prompts knowledge from large language models (LLMs) to describe images. Concretely, given a category label of an image, e.g. refinery, the knowledge, e.g. a refinery could be seen with large storage tanks, pipework, and ..., extracted by LLMs is used as the language counterpart. The knowledge supplements, e.g. the common relations among entities most likely appearing in a scene. We create I
Authors
(none)
Tags
Stats
Related papers
- Learning By Hallucinating: Vision-language Pre-training With Weak Supervision (2022)4.52
- Unsupervised Vision-and-language Pre-training Via Retrieval-based Multi-granular Alignment (2022)10.48
- Visual Adaptive Prompting For Compositional Zero-shot Learning (2025)2.26
- Leveraging Retrieval-augmented Tags For Large Vision-language Understanding In Complex Scenes (2024)0.00
- Retrieval-enhanced Visual Prompt Learning For Few-shot Classification (2023)4.52
- Weakly Supervised Vision-and-language Pre-training With Relative Representations (2023)3.58
- Vop: Text-video Co-operative Prompt Tuning For Cross-modal Retrieval (2022)16.41
- Mllms-augmented Visual-language Representation Learning (2023)0.00