X-aligner: Composed Visual Retrieval Without The Bells And Whistles
2026 Β· Yuqian Zheng, Mariana-Iuliana Georgescu
Abstract
Composed Video Retrieval (CoVR) facilitates video retrieval by combining visual and textual queries. However, existing CoVR frameworks typically fuse multimodal inputs in a single stage, achieving only marginal gains over initial baseline. To address this, we propose a novel CoVR framework that leverages the representational power of Vision Language Models (VLMs). Our framework incorporates a novel cross-attention module X-Aligner, composed of cross-attention layers that progressively fuse visual and textual inputs and align their multimodal representation with that of the target video. To further enhance the representation of the multimodal query, we incorporate the caption of the visual query as an additional input. The framework is trained in two stages to preserve the pretrained VLM representation. In the first stage, only the newly introduced module is trained, while in the second stage, the textual query encoder is also fine-tuned. We implement our framework on top of BLIP-family
Authors
(none)
Tags
Stats
Related papers
- PREGEN: Uncovering Latent Thoughts In Composed Video Retrieval (2026)0.00
- Composed Video Retrieval Via Enriched Context And Discriminative Embeddings (2024)12.19
- Unicvr: From Alignment To Reranking For Unified Zero-shot Composed Visual Retrieval (2026)0.00
- Covr-r:reason-aware Composed Video Retrieval (2026)2.02
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00
- Realign: Optimizing The Visual Document Retriever With Reasoning-guided Fine-grained Alignment (2026)2.20
- Discovla: Discrepancy Reduction In Vision, Language, And Alignment For Parameter-efficient Video-text Retrieval (2025)6.30
- MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion (2025)2.26