Abstract

In the realm of cross-modal retrieval, seamlessly integrating diverse modalities within multimedia remains a formidable challenge, especially given the complexities introduced by noisy correspondence learning (NCL). Such noise often stems from mismatched data pairs, which is a significant obstacle distinct from traditional noisy labels. This paper introduces Pseudo-Classification based Pseudo-Captioning (PC\(^2\)) framework to address this challenge. PC\(^2\) offers a threefold strategy: firstly, it establishes an auxiliary "pseudo-classification" task that interprets captions as categorical labels, steering the model to learn image-text semantic similarity through a non-contrastive mechanism. Secondly, unlike prevailing margin-based techniques, capitalizing on PC\(^2\)'s pseudo-classification capability, we generate pseudo-captions to provide more informative and tangible supervision for each mismatched pair. Thirdly, the oscillation of pseudo-classification is borrowed to assistant t

Authors

(none)

Tags

  • Cross-Modal Hashing
  • Image Retrieval

Stats

  • citations7
  • S2 citationsβ€”
  • github stars16
  • HF likes0
  • heat score9.23
  • arxiv keyduan2024pc

Related papers