Finevit: Progressively Unlocking Fine-grained Perception With Dense Recaptions
2026 Β· Peisen Zhao, Xiaopeng Zhang, Mingxing Xu, et al.
Abstract
While Multimodal Large Language Models (MLLMs) have experienced rapid advancements, their visual encoders frequently remain a performance bottleneck. Conventional CLIP-based encoders struggle with dense spatial tasks due to the loss of visual details caused by low-resolution pretraining and the reliance on noisy, coarse web-crawled image-text pairs. To overcome these limitations, we introduce FineViT, a novel vision encoder specifically designed to unlock fine-grained perception. By replacing coarse web data with dense recaptions, we systematically mitigate information loss through a progressive training paradigm.: first, the encoder is trained from scratch at a high native resolution on billions of global recaptioned image-text pairs, establishing a robust, detail rich semantic foundation. Subsequently, we further enhance its local perception through LLM alignment, utilizing our curated FineCap-450M dataset that comprises over \(450\) million high quality local captions. Extensive exp
Authors
(none)
Tags
Stats
Related papers
- Infusing Fine-grained Visual Knowledge To Vision-language Models (2025)0.00
- Vidvec: Unlocking Video MLLM Embeddings For Video-text Retrieval (2026)0.00
- Finelip: Extending Clip's Reach Via Fine-grained Alignment With Longer Text Inputs (2025)6.34
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- FLAIR: VLM With Fine-grained Language-informed Image Representations (2024)10.14
- FG-CLIP: Fine-grained Visual And Textual Alignment (2025)5.75
- Come-vl: Scaling Complementary Multi-encoder Vision-language Learning (2026)0.00
- Fine-tuned CLIP Models Are Efficient Video Learners (2022)21.57