Aligning Vision Models With Human Aesthetics In Retrieval: Benchmarks And Algorithms
2024 Β· Miaosen Zhang, Yixuan Wei, Zhen Xing, et al.
Abstract
Modern vision models are trained on very large noisy datasets. While these models acquire strong capabilities, they may not follow the user's intent to output the desired results in certain aspects, e.g., visual aesthetic, preferred style, and responsibility. In this paper, we target the realm of visual aesthetics and aim to align vision models with human aesthetic standards in a retrieval system. Advanced retrieval systems usually adopt a cascade of aesthetic models as re-rankers or filters, which are limited to low-level features like saturation and perform poorly when stylistic, cultural or knowledge contexts are involved. We find that utilizing the reasoning ability of large language models (LLMs) to rephrase the search query and extend the aesthetic expectations can make up for this shortcoming. Based on the above findings, we propose a preference-based reinforcement learning method that fine-tunes the vision models to distill the knowledge from both LLMs reasoning and the aesthet
Authors
(none)
Tags
Stats
Related papers
- Mrag-bench: Vision-centric Evaluation For Retrieval-augmented Multimodal Models (2024)0.00
- Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models And Vision Language Models (2024)8.82
- A Little More Like This: Text-to-image Retrieval With Vision-language Models Using Relevance Feedback (2025)0.00
- Toward Automatic Relevance Judgment Using Vision--language Models For Image--text Retrieval Evaluation (2024)0.00
- Vision-deepresearch Benchmark: Rethinking Visual And Textual Search For Multimodal Large Language Models (2026)7.27
- Benchmarking Deflection And Hallucination In Large Vision-language Models (2026)0.00
- Realign: Optimizing The Visual Document Retriever With Reasoning-guided Fine-grained Alignment (2026)2.20
- Indexing Multimodal Language Models For Large-scale Image Retrieval (2026)0.00