MSR-VTT
Canonical12papers using it
2023first seen
Papers using MSR-VTT (12)
- PEEK: Picking Essential frames via Efficient Knowledge distillationInstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal UnderstandingBima: Towards Biases Mitigation For Text-video Retrieval Via Scene Element GuidanceABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video CaptioningDelving Deeper: Hierarchical Visual Perception for Robust Video-Text RetrievalFrom Captions To Keyframes: Keyscore For Multimodal Frame Scoring And Video-language UnderstandingGAIS: Frame-level Gated Audio-visual Integration With Semantic Variance-scaled Perturbation For Text-video RetrievalDistilling Vision-Language Models on Millions of VideosVideo-Teller: Enhancing Cross-Modal Generation with Fusion and
DecouplingLinear Alignment of Vision-language Models for Image CaptioningMug-STAN: Adapting Image-Language Pretrained Models for General Video
UnderstandingE-ViLM: Efficient Video-Language Model via Masked Video Modeling with
Semantic Vector-Quantized Tokenizer