Retrieving Counterfactuals Improves Visual In-context Learning
2026 Β· Guangzhi Xiong, Sanchit Sinha, Zhenghao He, et al.
Abstract
Vision-language models (VLMs) have achieved impressive performance across a wide range of multimodal reasoning tasks, but they often struggle to disentangle fine-grained visual attributes and reason about underlying causal relationships. In-context learning (ICL) offers a promising avenue for VLMs to adapt to new tasks, but its effectiveness critically depends on the selection of demonstration examples. Existing retrieval-augmented approaches typically rely on passive similarity-based retrieval, which tends to select correlated but non-causal examples, amplifying spurious associations and limiting model robustness. We introduce CIRCLES (Composed Image Retrieval for Causal Learning Example Selection), a novel framework that actively constructs demonstration sets by retrieving counterfactual-style examples through targeted, attribute-guided composed image retrieval. By incorporating counterfactual-style examples, CIRCLES enables VLMs to implicitly reason about the causal relations betwee
Authors
(none)
Tags
Stats
Related papers
- Cir-cot: Towards Interpretable Composed Image Retrieval Via End-to-end Chain-of-thought Reasoning (2025)0.00
- V-retrver: Evidence-driven Agentic Reasoning For Universal Multimodal Retrieval (2026)0.00
- Imageref-vl: Enabling Contextual Image Referencing In Vision-language Models (2025)1.91
- Recall: Recalibrating Capability Degradation For Mllm-based Composed Image Retrieval (2026)2.90
- Leveraging Large Vision-language Model As User Intent-aware Encoder For Composed Image Retrieval (2024)3.58
- Mcot-mvs: Multi-level Vision Selection By Multi-modal Chain-of-thought Reasoning For Composed Image Retrieval (2026)0.00
- Love Me, Love My Label: Rethinking The Role Of Labels In Prompt Retrieval For Visual In-context Learning (2026)1.57
- What Makes Good Examples For Visual In-context Learning? (2023)3.58