Query-centric Audio-visual Cognition Network For Moment Retrieval, Segmentation And Step-captioning
2024 Β· Yunbin Tu, Liang Li, Li Su, et al.
Abstract
Video has emerged as a favored multimedia format on the internet. To better gain video contents, a new topic HIREST is presented, including video retrieval, moment retrieval, moment segmentation, and step-captioning. The pioneering work chooses the pre-trained CLIP-based model for video retrieval, and leverages it as a feature extractor for other three challenging tasks solved in a multi-task learning paradigm. Nevertheless, this work struggles to learn the comprehensive cognition of user-preferred content, due to disregarding the hierarchies and association relations across modalities. In this paper, guided by the shallow-to-deep principle, we propose a query-centric audio-visual cognition (QUAG) network to construct a reliable multi-modal representation for moment retrieval, segmentation and step-captioning. Specifically, we first design the modality-synergistic perception to obtain rich audio-visual content, by modeling global contrastive alignment and local fine-grained interaction
Authors
(none)
Tags
Stats
Related papers
- Hierarchical Video-moment Retrieval And Step-captioning (2023)12.54
- Towards Efficient And Robust Moment Retrieval System: A Unified Framework For Multi-granularity Models And Temporal Reranking (2025)2.26
- Unified Interactive Multimodal Moment Retrieval Via Cascaded Embedding-reranking And Temporal-aware Score Fusion (2025)0.00
- HVD: Human Vision-driven Video Representation Learning For Text-video Retrieval (2026)0.00
- Clamr: Contextualized Late-interaction For Multimodal Content Retrieval (2025)0.00
- Audio Does Matter: Importance-aware Multi-granularity Fusion For Video Moment Retrieval (2025)4.49
- Viseret: A Simple Yet Effective Approach To Moment Retrieval Via Fine-grained Video Segmentation (2021)0.00
- Contextiq: A Multimodal Expert-based Video Retrieval System For Contextual Advertising (2024)0.00