DiDeMo

Emerging

11papers using it

2021first seen

About DiDeMo contains 10K long-form videos from Flickr. For each video, ~4 short sentences are annotated in temporal order. We follow the existing works to concatenate those short sentences and evaluate ‘paragraph-to-video’ retrieval on this benchmark. We adopt the official split: Train: 8,395 videos, 8,395 captions (c

🔎 Find this dataset

Papers using DiDeMo (11)

InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding2026

Bima: Towards Biases Mitigation For Text-video Retrieval Via Scene Element Guidance2025 · 1 cites

Delving Deeper: Hierarchical Visual Perception for Robust Video-Text Retrieval2026

From Captions To Keyframes: Keyscore For Multimodal Frame Scoring And Video-language Understanding2025

GAIS: Frame-level Gated Audio-visual Integration With Semantic Variance-scaled Perturbation For Text-video Retrieval2025

Weakly Supervised Temporal Adjacent Network for Language Grounding2021 · 87 cites

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment2022 · 53 cites

Cross-Modal Adapter for Vision-Language Retrieval2022 · 17 cites

MuMUR : Multilingual Multimodal Universal Retrieval2022

Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval2023

Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding2023