Dreamlip: Language-image Pre-training With Long Captions
2024 Β· Kecheng Zheng, Yifei Zhang, Wei Wu, et al.
Abstract
Language-image pre-training largely relies on how precisely and thoroughly a text describes its paired image. In practice, however, the contents of an image can be so rich that well describing them requires lengthy captions (e.g., with 10 sentences), which are usually missing in existing datasets. Consequently, there are currently no clear evidences on whether and how language-image pre-training could benefit from long captions. To figure this out, we first re-caption 30M images with detailed descriptions using a pre-trained Multi-modality Large Language Model (MLLM), and then study the usage of the resulting captions under a contrastive learning framework. We observe that, each sentence within a long caption is very likely to describe the image partially (e.g., an object). Motivated by this, we propose to dynamically sample sub-captions from the text label to construct multiple positive pairs, and introduce a grouping loss to match the embeddings of each sub-caption with its correspon
Authors
(none)
Tags
Stats
Related papers
- Lotlip: Improving Language-image Pre-training For Long Text Understanding (2024)2.26
- Modeling Caption Diversity In Contrastive Vision-language Pretraining (2024)0.00
- Mllms-augmented Visual-language Representation Learning (2023)0.00
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- CLIP Is Shortsighted: Paying Attention Beyond The First Sentence (2026)0.00
- CLIPS: An Enhanced CLIP Framework For Learning With Synthetic Captions (2024)0.00
- Finelip: Extending Clip's Reach Via Fine-grained Alignment With Longer Text Inputs (2025)6.34
- Imagebert: Cross-modal Pre-training With Large-scale Weak-supervised Image-text Data (2020)0.00