Unifgvc: Universal Training-free Few-shot Fine-grained Vision Classification Via Attribute-aware Multimodal Retrieval
2025 Β· Hongyu Guo, Xiangzhao Hao, Jiarui Guo, et al.
Abstract
Few-shot fine-grained visual classification (FGVC) aims to leverage limited data to enable models to discriminate subtly distinct categories. Recent works mostly finetuned the pre-trained visual language models to achieve performance gain, yet suffering from overfitting and weak generalization. To deal with this, we introduce UniFGVC, a universal training-free framework that reformulates few-shot FGVC as multimodal retrieval. First, we propose the Category-Discriminative Visual Captioner (CDV-Captioner) to exploit the open-world knowledge of multimodal large language models (MLLMs) to generate a structured text description that captures the fine-grained attribute features distinguishing closely related classes. CDV-Captioner uses chain-of-thought prompting and visually similar reference images to reduce hallucination and enhance discrimination of generated captions. Using it we can convert each image into an image-description pair, enabling more comprehensive feature representation, an
Authors
(none)
Tags
Stats
Related papers
- One-shot Fine-grained Instance Retrieval (2017)10.35
- Language-driven Fine-grained Retrieval (2025)0.00
- Fine-grained Image Retrieval Via Dual-vision Adaptation (2025)0.00
- DVF: Advancing Robust And Accurate Fine-grained Image Retrieval With Retrieval Guidelines (2024)9.03
- Unicvr: From Alignment To Reranking For Unified Zero-shot Composed Visual Retrieval (2026)0.00
- Finevit: Progressively Unlocking Fine-grained Perception With Dense Recaptions (2026)0.00
- Globaldoc: A Cross-modal Vision-language Framework For Real-world Document Image Retrieval And Classification (2023)3.58
- FG-CLIP: Fine-grained Visual And Textual Alignment (2025)5.75