Multi-level CLS Token Fusion For Contrastive Learning In Endoscopy Image Classification
2025 Β· Y Hop Nguyen, Doan Anh Phan Huu, Trung Thai Tran, et al.
Abstract
We present a unified vision-language framework tailored for ENT endoscopy image analysis that simultaneously tackles three clinically-relevant tasks: image classification, image-to-image retrieval, and text-to-image retrieval. Unlike conventional CNN-based pipelines that struggle to capture cross-modal semantics, our approach leverages the CLIP ViT-B/16 backbone and enhances it through Low-Rank Adaptation, multi-level CLS token aggregation, and spherical feature interpolation. These components collectively enable efficient fine-tuning on limited medical data while improving representation diversity and semantic alignment across modalities. To bridge the gap between visual inputs and textual diagnostic context, we introduce class-specific natural language prompts that guide the image encoder through a joint training objective combining supervised classification with contrastive learning. We validated our framework through participation in the ACM MM'25 ENTRep Grand Challenge, achieving
Authors
(none)
Tags
Stats
Related papers
- Multi-task Cross-modal Learning For Chest X-ray Image Retrieval (2026)0.00
- Medclip: Contrastive Learning From Unpaired Medical Images And Text (2022)26.02
- Exploring The Capabilities Of LLM Encoders For Image-text Retrieval In Chest X-rays (2025)0.00
- Efficient Medical Vision-language Alignment Through Adapting Masked Vision Models (2025)5.74
- Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training (2024)0.00
- Come-vl: Scaling Complementary Multi-encoder Vision-language Learning (2026)0.00
- Exploring A Unified Vision-centric Contrastive Alternatives On Multi-modal Web Documents (2025)1.69
- Uclip: Parameter-efficient Multilingual Extension Of Vision-language Models With Unpaired Data (2025)0.00