Query-centric Audio-visual Cognition Network For Moment Retrieval, Segmentation And Step-captioning
2024 Β· Yunbin Tu, Liang Li, Li Su, et al.
Abstract
Video has emerged as a favored multimedia format on the internet. To better gain video contents, a new topic HIREST is presented, including video retrieval, moment retrieval, moment segmentation, and step-captioning. The pioneering work chooses the pre-trained CLIP-based model for video retrieval, and leverages it as a feature extractor for other three challenging tasks solved in a multi-task learning paradigm. Nevertheless, this work struggles to learn the comprehensive cognition of user-preferred content, due to disregarding the hierarchies and association relations across modalities. In this paper, guided by the shallow-to-deep principle, we propose a query-centric audio-visual cognition (QUAG) network to construct a reliable multi-modal representation for moment retrieval, segmentation and step-captioning. Specifically, we first design the modality-synergistic perception to obtain rich audio-visual content, by modeling global contrastive alignment and local fine-grained interaction
Authors
(none)
Tags
Stats
Related papers
- Watch, Listen, And Describe: Globally And Locally Aligned Cross-modal Attentions For Video Captioning (2018)12.87
- SMART: Shot-aware Multimodal Video Moment Retrieval With Audio-enhanced MLLM (2025)0.00
- Learning Audio-video Modalities From Image Captions (2022)12.54
- Sequential Contrastive Audio-visual Learning (2024)5.84
- Listen, Look And Deliberate: Visual Context-aware Speech Recognition Using Pre-trained Text-video Representations (2020)5.84
- Enhancing Retrieval-augmented Audio Captioning With Generation-assisted Multimodal Querying And Progressive Learning (2024)3.58
- Improving Audio-text Retrieval Via Hierarchical Cross-modal Interaction And Auxiliary Captions (2023)0.00
- Avlnet: Learning Audio-visual Language Representations From Instructional Videos (2020)12.87