Tokenbinder: Text-video Retrieval With One-to-many Alignment Paradigm
2024 Β· Bingqing Zhang, Zhuo Cao, Heming Du, et al.
Abstract
Text-Video Retrieval (TVR) methods typically match query-candidate pairs by aligning text and video features in coarse-grained, fine-grained, or combined (coarse-to-fine) manners. However, these frameworks predominantly employ a one(query)-to-one(candidate) alignment paradigm, which struggles to discern nuanced differences among candidates, leading to frequent mismatches. Inspired by Comparative Judgement in human cognitive science, where decisions are made by directly comparing items rather than evaluating them independently, we propose TokenBinder. This innovative two-stage TVR framework introduces a novel one-to-many coarse-to-fine alignment paradigm, imitating the human cognitive process of identifying specific items within a large collection. Our method employs a Focused-view Fusion Network with a sophisticated cross-attention mechanism, dynamically aligning and comparing features across multiple videos to capture finer nuances and contextual variations. Extensive experiments on s
Authors
(none)
Tags
Stats
Related papers
- T2vparser: Adaptive Decomposition Tokens For Partial Alignment In Text To Video Retrieval (2025)0.95
- Continual Text-to-video Retrieval With Frame Fusion And Task-aware Routing (2025)8.75
- CLIP2TV: Align, Match And Distill For Video-text Retrieval (2021)0.00
- Beat: Bi-directional One-to-many Embedding Alignment For Text-based Person Retrieval (2024)10.85
- Prota: Probabilistic Token Aggregation For Text-video Retrieval (2024)4.52
- Tagging Before Alignment: Integrating Multi-modal Tags For Video-text Retrieval (2023)10.74
- X-aligner: Composed Visual Retrieval Without The Bells And Whistles (2026)0.00
- Video-colbert: Contextualized Late Interaction For Text-to-video Retrieval (2025)5.24