Towards Holistic Language-video Representation: The Language Model-enhanced Msr-video To Text Dataset
2024 Β· Yuchen Yang, Yingxuan Duan
Abstract
A more robust and holistic language-video representation is the key to pushing video understanding forward. Despite the improvement in training strategies, the quality of the language-video dataset is less attention to. The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks where queries are much more complex. This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware for more sophisticated representation learning needs, hence helping all downstream tasks. Our multifaceted video captioning method captures entities, actions, speech transcripts, aesthetics, and emotional cues, providing detailed and correlating information from the text side to the video side for training. We also develop an agent-like strategy using language models to generate high-quality, factual textual descriptions, reducing
Authors
(none)
Tags
Stats
Related papers
- Learning Language-visual Embedding For Movie Understanding With Natural-language (2016)0.00
- Leveraging Generative Language Models For Weakly Supervised Sentence Component Analysis In Video-language Joint Learning (2023)0.00
- Distilling Vision-language Models On Millions Of Videos (2024)7.50
- Context-enhanced Video Moment Retrieval With Large Language Models (2024)5.84
- Narrating The Video: Boosting Text-video Retrieval Via Comprehensive Utilization Of Frame-level Captions (2025)6.77
- MDMMT-2: Multidomain Multimodal Transformer For Video Retrieval, One More Step Towards Generalization (2022)0.00
- Panda-70m: Captioning 70M Videos With Multiple Cross-modality Teachers (2024)15.54
- Robustness Analysis Of Video-language Models Against Visual And Language Perturbations (2022)5.24