Revolutionizing Text-to-image Retrieval As Autoregressive Token-to-voken Generation
2024 Β· Yongqi Li, Hongru Cai, Wenjie Wang, et al.
Abstract
Text-to-image retrieval is a fundamental task in multimedia processing, aiming to retrieve semantically relevant cross-modal content. Traditional studies have typically approached this task as a discriminative problem, matching the text and image via the cross-attention mechanism (one-tower framework) or in a common embedding space (two-tower framework). Recently, generative cross-modal retrieval has emerged as a new research line, which assigns images with unique string identifiers and generates the target identifier as the retrieval target. Despite its great potential, existing generative approaches are limited due to the following issues: insufficient visual information in identifiers, misalignment with high-level semantics, and learning gap towards the retrieval target. To address the above issues, we propose an autoregressive voken generation method, named AVG. AVG tokenizes images into vokens, i.e., visual tokens, and innovatively formulates the text-to-image retrieval task as a
Authors
(none)
Tags
Stats
Related papers
- Cross-modal RAG: Sub-dimensional Text-to-image Retrieval-augmented Generation (2025)0.00
- Learning To Tokenize For Generative Retrieval (2023)4.52
- Tiger: Unifying Text-to-image Generation And Retrieval With Large Multimodal Models (2024)0.00
- Look, Imagine And Match: Improving Textual-visual Cross-modal Retrieval With Generative Models (2017)18.52
- Cost: Contrastive Quantization Based Semantic Tokenization For Generative Recommendation (2024)7.81
- AR-RAG: Autoregressive Retrieval Augmentation For Image Generation (2025)0.00
- Text-to-image Generation Via Implicit Visual Guidance And Hypernetwork (2022)0.00
- Generative Recall, Dense Reranking: Learning Multi-view Semantic Ids For Efficient Text-to-video Retrieval (2026)0.00