Vldeformer: Vision-language Decomposed Transformer For Fast Cross-modal Retrieval
2021 Β· Lisai Zhang, Hongfa Wu, Qingcai Chen, et al.
Abstract
Cross-model retrieval has emerged as one of the most important upgrades for text-only search engines (SE). Recently, with powerful representation for pairwise text-image inputs via early interaction, the accuracy of vision-language (VL) transformers has outperformed existing methods for text-image retrieval. However, when the same paradigm is used for inference, the efficiency of the VL transformers is still too low to be applied in a real cross-modal SE. Inspired by the mechanism of human learning and using cross-modal knowledge, this paper presents a novel Vision-Language Decomposed Transformer (VLDeformer), which greatly increases the efficiency of VL transformers while maintaining their outstanding accuracy. By the proposed method, the cross-model retrieval is separated into two stages: the VL transformer learning stage, and the VL decomposition stage. The latter stage plays the role of single modal indexing, which is to some extent like the term indexing of a text SE. The model le
Authors
(none)
Tags
Stats
Related papers
- Thinking Fast And Slow: Efficient Text-to-visual Retrieval With Transformers (2021)15.16
- Globaldoc: A Cross-modal Vision-language Framework For Real-world Document Image Retrieval And Classification (2023)3.58
- Evo-retriever: Llm-guided Curriculum Evolution With Viewpoint-pathway Collaboration For Multimodal Document Retrieval (2026)0.00
- Towards Efficient Cross-modal Visual Textual Retrieval Using Transformer-encoder Deep Features (2021)6.34
- VITR: Augmenting Vision Transformers With Relation-focused Learning For Cross-modal Information Retrieval (2023)4.52
- Unicoder-vl: A Universal Encoder For Vision And Language By Cross-modal Pre-training (2019)20.24
- Lightningdot: Pre-training Visual-semantic Embeddings For Real-time Image-text Retrieval (2021)17.42
- Exploring A Unified Vision-centric Contrastive Alternatives On Multi-modal Web Documents (2025)1.69