Matching Images And Text With Multi-modal Tensor Fusion And Re-ranking
2019 Β· Tan Wang, Xing Xu, Yang Yang, et al.
Abstract
A major challenge in matching images and text is that they have intrinsically different data distributions and feature representations. Most existing approaches are based either on embedding or classification, the first one mapping image and text instances into a common embedding space for distance measuring, and the second one regarding image-text matching as a binary classification problem. Neither of these approaches can, however, balance the matching accuracy and model complexity well. We propose a novel framework that achieves remarkable matching performance with acceptable model complexity. Specifically, in the training stage, we propose a novel Multi-modal Tensor Fusion Network (MTFN) to explicitly learn an accurate image-text similarity function with rank-based tensor fusion rather than seeking a common embedding space for each image-text instance. Then, during testing, we deploy a generic Cross-modal Re-ranking (RR) scheme for refinement without requiring additional training p
Authors
(none)
Tags
Stats
Related papers
- Transformer Reasoning Network For Image-text Matching And Retrieval (2020)16.15
- Enhancing Image-text Matching With Adaptive Feature Aggregation (2024)6.34
- Ranking-based Fusion Algorithms For Extreme Multi-label Text Classification (XMTC) (2025)0.00
- Joint Fusion And Encoding: Advancing Multimodal Retrieval From The Ground Up (2025)0.00
- When Vision Meets Texts In Listwise Reranking (2026)0.00
- Transcending Fusion: A Multi-scale Alignment Method For Remote Sensing Image-text Retrieval (2024)11.92
- Exploiting "quantum-like Interference" In Decision Fusion For Ranking Multimodal Documents (2018)0.00
- Retrieve Fast, Rerank Smart: Cooperative And Joint Approaches For Improved Cross-modal Retrieval (2021)10.97