Towards Balanced Alignment: Modal-enhanced Semantic Modeling For Video Moment Retrieval
2023 Β· Zhihang Liu, Jun Li, Hongtao Xie, et al.
Abstract
Video Moment Retrieval (VMR) aims to retrieve temporal segments in untrimmed videos corresponding to a given language query by constructing cross-modal alignment strategies. However, these existing strategies are often sub-optimal since they ignore the modality imbalance problem, \textit\{i.e.\}, the semantic richness inherent in videos far exceeds that of a given limited-length sentence. Therefore, in pursuit of better alignment, a natural idea is enhancing the video modality to filter out query-irrelevant semantics, and enhancing the text modality to capture more segment-relevant knowledge. In this paper, we introduce Modal-Enhanced Semantic Modeling (MESM), a novel framework for more balanced alignment through enhancing features at two levels. First, we enhance the video modality at the frame-word level through word reconstruction. This strategy emphasizes the portions associated with query words in frame-level features while suppressing irrelevant parts. Therefore, the enhanced vid
Authors
(none)
Tags
Stats
Related papers
- Modality-balanced Embedding For Video Retrieval (2022)7.16
- Context-enhanced Video Moment Retrieval With Large Language Models (2024)5.84
- Vlanet: Video-language Alignment Network For Weakly-supervised Video Moment Retrieval (2020)13.28
- Frame-wise Cross-modal Matching For Video Moment Retrieval (2020)13.17
- Boosting Video-text Retrieval With Explicit High-level Semantics (2022)7.50
- Semantic Video Moments Retrieval At Scale: A New Task And A Baseline (2022)0.00
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00
- Hybrid-learning Video Moment Retrieval Across Multi-domain Labels (2024)0.00