MVAM: Multi-view Attention Method For Fine-grained Image-text Matching
2024 Β· Wanqing Cui, Rui Cheng, Jiafeng Guo, et al.
Abstract
Existing two-stream models, such as CLIP, encode images and text through independent representations, showing good performance while ensuring retrieval speed, have attracted attention from industry and academia. However, the single representation often struggles to capture complex content fully. Such models may ignore fine-grained information during matching, resulting in suboptimal retrieval results. To overcome this limitation and enhance the performance of two-stream models, we propose a Multi-view Attention Method (MVAM) for image-text matching. This approach leverages diverse attention heads with unique view codes to learn multiple representations for images and text, which are then concatenated for matching. We also incorporate a diversity objective to explicitly encourage attention heads to focus on distinct aspects of the input data, capturing complementary fine-grained details. This diversity enables the model to represent image-text pairs from multiple perspectives, ensuring
Authors
(none)
Tags
Stats
Related papers
- Improving Video-text Retrieval By Multi-stream Corpus Alignment And Dual Softmax Loss (2021)0.00
- COTS: Collaborative Two-stream Vision-language Pre-training Model For Cross-modal Retrieval (2022)13.60
- Mv-adapter: Multimodal Video Transfer Learning For Video Text Retrieval (2023)9.76
- VISTA: Visualized Text Embedding For Universal Multi-modal Retrieval (2024)16.73
- A New Fine-grained Alignment Method For Image-text Matching (2023)0.00
- MCAD: Multi-teacher Cross-modal Alignment Distillation For Efficient Image-text Retrieval (2023)3.58
- AMC: Attention Guided Multi-modal Correlation Learning For Image Search (2017)10.97
- Exploring A Fine-grained Multiscale Method For Cross-modal Remote Sensing Image Retrieval (2022)16.73