Lotlip: Improving Language-image Pre-training For Long Text Understanding
2024 Β· Wei Wu, Kecheng Zheng, Shuailei Ma, et al.
Abstract
Understanding long text is of great demands in practice but beyond the reach of most language-image pre-training (LIP) models. In this work, we empirically confirm that the key reason causing such an issue is that the training images are usually paired with short captions, leaving certain tokens easily overshadowed by salient tokens. Towards this problem, our initial attempt is to relabel the data with long captions, however, directly learning with which may lead to performance degradation in understanding short text (e.g., in the image classification task). Then, after incorporating corner tokens to aggregate diverse textual information, we manage to help the model catch up to its original level of short text understanding yet greatly enhance its capability of long text understanding. We further look into whether the model can continuously benefit from longer captions and notice a clear trade-off between the performance and the efficiency. Finally, we validate the effectiveness of our
Authors
(none)
Tags
Stats
Related papers
- Dreamlip: Language-image Pre-training With Long Captions (2024)10.61
- Finelip: Extending Clip's Reach Via Fine-grained Alignment With Longer Text Inputs (2025)6.34
- TULIP: Token-length Upgraded CLIP (2024)3.04
- Long-clip: Unlocking The Long-text Capability Of CLIP (2024)14.90
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- Modeling Caption Diversity In Contrastive Vision-language Pretraining (2024)0.00
- Lexlip: Lexicon-bottlenecked Language-image Pre-training For Large-scale Image-text Retrieval (2023)10.85
- Mllms-augmented Visual-language Representation Learning (2023)0.00