Detailclip: Injecting Image Details Into Clip's Feature Space
2022 Β· Zilun Zhang, Cuifeng Shen, Yuan Shen, et al.
Abstract
Although CLIP-like Visual Language Models provide a functional joint feature space for image and text, due to the limitation of the CILP-like model's image input size (e.g., 224), subtle details are lost in the feature representation if we input high-resolution images (e.g., 2240). Our proposed framework addresses this issue by generating a single feature representation for a high-resolution image that retains image details from different scales while sharing the same semantic space as the original CLIP. An application scenario is remote sensing text-image retrieval, where targets (e.g., vehicles and ships) often appear at tiny scales. To achieve this, we develop a feature fusion model that relies on CLIP features extracted from a carefully designed image patch method, dubbed Complete Cover. This method ensures comprehensive coverage of objects across various scales and is weakly supervised by image-agnostic class prompted queries. We evaluate our framework's performance using real-wor
Authors
(none)
Tags
Stats
Related papers
- FLAIR: VLM With Fine-grained Language-informed Image Representations (2024)10.14
- Finetuning CLIP To Reason About Pairwise Differences (2024)0.00
- FLEX-CLIP: Feature-level Generation Network Enhanced CLIP For X-shot Cross-modal Retrieval (2024)0.00
- Finelip: Extending Clip's Reach Via Fine-grained Alignment With Longer Text Inputs (2025)6.34
- Distill CLIP (DCLIP): Enhancing Image-text Retrieval Via Cross-modal Transformer Distillation (2025)0.00
- Lightclip: Learning Multi-level Interaction For Lightweight Vision-language Models (2023)0.00
- Detailfusion: A Dual-branch Framework With Detail Enhancement For Composed Image Retrieval (2025)0.00
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26