Dynamic Visual Semantic Sub-embeddings And Fast Re-ranking
2023 Β· Wenzhang Wei, Zhipeng Gui, Changguang Wu, et al.
Abstract
The core of cross-modal matching is to accurately measure the similarity between different modalities in a unified representation space. However, compared to textual descriptions of a certain perspective, the visual modality has more semantic variations. So, images are usually associated with multiple textual captions in databases. Although popular symmetric embedding methods have explored numerous modal interaction approaches, they often learn toward increasing the average expression probability of multiple semantic variations within image embeddings. Consequently, information entropy in embeddings is increased, resulting in redundancy and decreased accuracy. In this work, we propose a Dynamic Visual Semantic Sub-Embeddings framework (DVSE) to reduce the information entropy. Specifically, we obtain a set of heterogeneous visual sub-embeddings through dynamic orthogonal constraint loss. To encourage the generated candidate embeddings to capture various semantic variations, we construct
Authors
(none)
Tags
Stats
Related papers
- Univse: Robust Visual Semantic Embeddings Via Structured Semantic Representations (2019)0.00
- Improving Visual-semantic Embeddings By Learning Semantically-enhanced Hard Negatives For Cross-modal Information Retrieval (2022)9.41
- Maximal Matching Matters: Preventing Representation Collapse For Robust Cross-modal Retrieval (2025)2.26
- Generalized Multi-view Embedding For Visual Recognition And Cross-modal Retrieval (2016)14.69
- Multiple Visual-semantic Embedding For Video Retrieval From Query Sentence (2020)2.26
- Improving Cross-modal Retrieval With Set Of Diverse Embeddings (2022)13.55
- Vldeformer: Vision-language Decomposed Transformer For Fast Cross-modal Retrieval (2021)10.21
- Preserving Semantic Neighborhoods For Robust Cross-modal Retrieval (2020)10.07