Memory Enhanced Embedding Learning For Cross-modal Video-text Retrieval
2021 Β· Rui Zhao, Kecheng Zheng, Zheng-Jun Zha, et al.
Abstract
Cross-modal video-text retrieval, a challenging task in the field of vision and language, aims at retrieving corresponding instance giving sample from either modality. Existing approaches for this task all focus on how to design encoding model through a hard negative ranking loss, leaving two key problems unaddressed during this procedure. First, in the training stage, only a mini-batch of instance pairs is available in each iteration. Therefore, this kind of hard negatives is locally mined inside a mini-batch while ignoring the global negative samples among the dataset. Second, there are many text descriptions for one video and each text only describes certain local features of a video. Previous works for this task did not consider to fuse the multiply texts corresponding to a video during the training. In this paper, to solve the above two problems, we propose a novel memory enhanced embedding learning (MEEL) method for videotext retrieval. To be specific, we construct two kinds of m
Authors
(none)
Tags
Stats
Related papers
- Learning A Text-video Embedding From Incomplete And Heterogeneous Data (2018)4.18
- Dual Encoding For Video Retrieval By Text (2020)16.05
- Vidvec: Unlocking Video MLLM Embeddings For Video-text Retrieval (2026)0.00
- Stacked Convolutional Deep Encoding Network For Video-text Retrieval (2020)7.81
- Dual-modal Attention-enhanced Text-video Retrieval With Triplet Partial Margin Contrastive Learning (2023)8.82
- Modality-balanced Embedding For Video Retrieval (2022)7.16
- Embedding-based Retrieval In Multimodal Content Moderation (2025)2.26
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00