Abstract

Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice. Recently, to alleviate expensive data collection, co-occurring pairs from the Internet are automatically harvested for training. However, it inevitably includes mismatched pairs, \ie, noisy correspondences, undermining supervision reliability and degrading performance. Current methods leverage deep neural networks' memorization effect to address noisy correspondences, which overconfidently focus on *similarity-guided training with hard negatives* and suffer from self-reinforcing errors. In light of above, we introduce a novel noisy correspondence learning framework, namely \textbf\{S\}elf-\textbf\{R\}einforcing \textbf\{E\}rrors \textbf\{M\}itigation (SREM). Specifically, by viewing sample matching as classification tasks within the batch, we generate classification logits for the given sample. Instead of a single similarity score, we refine sample filtration through energy uncertainty and

Authors

(none)

Tags

  • Uncategorized

Stats

  • citations11
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score8.09
  • arxiv keydang2023noisy

Related papers