Learning Image-text Matching With Optimal Partial Transport
2026 Β· Zhengxin Pan, Haishuai Wang, Fangyu Wu, et al.
Abstract
Cross-modal matching, a fundamental task in bridging vision and language, has recently garnered substantial research interest. Despite the development of numerous methods aimed at quantifying the semantic relatedness between image-text pairs, these methods often fall short of achieving both outstanding performance and high efficiency. In this paper, we propose the crOss-Modal sInkhorn maTching (OMIT) network as an effective solution to effectively improving performance while maintaining efficiency. Rooted in the theoretical foundations of Optimal Transport, OMIT harnesses the capabilities of Cross-modal Mover's Distance to precisely compute the similarity between fine-grained visual and textual fragments, utilizing Sinkhorn iterations for efficient approximation. To further alleviate the issue of redundant alignments, we seamlessly integrate partial matching into OMIT, leveraging local-to-global similarities to eliminate the interference of irrelevant fragments. We conduct extensive ev
Authors
(none)
Tags
Stats
Related papers
- A New Fine-grained Alignment Method For Image-text Matching (2023)0.00
- Viclip-ot: The First Foundation Vision-language Model For Vietnamese Image-text Retrieval With Optimal Transport (2026)0.00
- Learning To Rematch Mismatched Pairs For Robust Cross-modal Retrieval (2024)13.82
- Maximal Matching Matters: Preventing Representation Collapse For Robust Cross-modal Retrieval (2025)2.26
- Intra-modal Constraint Loss For Image-text Retrieval (2022)8.33
- ALADIN: Distilling Fine-grained Alignment Scores For Efficient Image-text Matching And Retrieval (2022)14.00
- Deep Multimodal Image-text Embeddings For Automatic Cross-media Retrieval (2020)0.00
- Dual-view Curricular Optimal Transport For Cross-lingual Cross-modal Retrieval (2023)9.03