FLEX-CLIP: Feature-level Generation Network Enhanced CLIP For X-shot Cross-modal Retrieval
2024 Β· Jingyou Xie, Jiayi Kuang, Zhenzhou Lin, et al.
Abstract
Given a query from one modality, few-shot cross-modal retrieval (CMR) retrieves semantically similar instances in another modality with the target domain including classes that are disjoint from the source domain. Compared with classical few-shot CMR methods, vision-language pretraining methods like CLIP have shown great few-shot or zero-shot learning performance. However, they still suffer challenges due to (1) the feature degradation encountered in the target domain and (2) the extreme data imbalance. To tackle these issues, we propose FLEX-CLIP, a novel Feature-level Generation Network Enhanced CLIP. FLEX-CLIP includes two training stages. In multimodal feature generation, we propose a composite multimodal VAE-GAN network to capture real feature distribution patterns and generate pseudo samples based on CLIP features, addressing data imbalance. For common space projection, we develop a gate residual network to fuse CLIP features with projected features, reducing feature degradation
Authors
(none)
Tags
Stats
Related papers
- Clip-moe: Towards Building Mixture Of Experts For CLIP With Diversified Multiplet Upcycling (2024)2.26
- Distill CLIP (DCLIP): Enhancing Image-text Retrieval Via Cross-modal Transformer Distillation (2025)0.00
- Cross-modal Retrieval Meets Inference:improving Zero-shot Classification With Cross-modal Retrieval (2023)0.00
- Lightclip: Learning Multi-level Interaction For Lightweight Vision-language Models (2023)0.00
- A Comprehensive Empirical Study Of Vision-language Pre-trained Model For Supervised Cross-modal Retrieval (2022)0.00
- Multi-task Cross-modal Learning For Chest X-ray Image Retrieval (2026)0.00
- FG-CLIP: Fine-grained Visual And Textual Alignment (2025)5.75
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26