Abstract

The core problem of text-based person retrieval is how to bridge the heterogeneous gap between multi-modal data. Many previous approaches contrive to learning a latent common manifold mapping paradigm following a \textbf\{cross-modal distribution consensus prediction (CDCP)\} manner. When mapping features from distribution of one certain modality into the common manifold, feature distribution of the opposite modality is completely invisible. That is to say, how to achieve a cross-modal distribution consensus so as to embed and align the multi-modal features in a constructed cross-modal common manifold all depends on the experience of the model itself, instead of the actual situation. With such methods, it is inevitable that the multi-modal data can not be well aligned in the common manifold, which finally leads to a sub-optimal retrieval performance. To overcome this \textbf\{CDCP dilemma\}, we propose a novel algorithm termed LBUL to learn a Consistent Cross-modal Common Manifold (C\(

Authors

(none)

Tags

  • Cross-Modal Hashing
  • Image Retrieval

Stats

  • citations110
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score15.34
  • arxiv keywang2022look

Related papers