GMS-CAVP: Improving Audio-video Correspondence With Multi-scale Contrastive And Generative Pretraining
2026 Β· Shentong Mo, Zehua Chen, Jun Zhu
Abstract
Recent advances in video-audio (V-A) understanding and generation have increasingly relied on joint V-A embeddings, which serve as the foundation for tasks such as cross-modal retrieval and generation. While prior methods like CAVP effectively model semantic and temporal correspondences between modalities using contrastive objectives, their performance remains suboptimal. A key limitation is the insufficient modeling of the dense, multi-scale nature of both video and audio signals, correspondences often span fine- to coarse-grained spatial-temporal structures, which are underutilized in existing frameworks. To this end, we propose GMS-CAVP, a novel framework that combines Multi-Scale Video-Audio Alignment and Multi-Scale Spatial-Temporal Diffusion-based pretraining objectives to enhance V-A correspondence modeling. First, GMS-CAVP introduces a multi-scale contrastive learning strategy that captures semantic and temporal relations across varying granularities. Second, we go beyond tradi
Authors
(none)
Tags
Stats
Related papers
- Contrastive Audio-visual Masked Autoencoder (2022)4.93
- Diffgap: A Lightweight Diffusion Module In Contrastive Space For Bridging Cross-model Gap (2025)3.58
- Sequential Contrastive Audio-visual Learning (2024)5.84
- Siamese Vision Transformers Are Scalable Audio-visual Learners (2024)7.47
- Quality Over Quantity? Llm-based Curation For A Data-efficient Audio-video Foundation Model (2025)0.00
- TC-MGC: Text-conditioned Multi-grained Contrastive Learning For Text-video Retrieval (2025)6.93
- CAVL: Learning Contrastive And Adaptive Representations Of Vision And Language (2023)0.00
- Come-vl: Scaling Complementary Multi-encoder Vision-language Learning (2026)0.00