Multiscale Matching Driven By Cross-modal Similarity Consistency For Audio-text Retrieval
2024 Β· Qian Wang, Jia-Chen Gu, Zhen-Hua Ling
Abstract
Audio-text retrieval (ATR), which retrieves a relevant caption given an audio clip (A2T) and vice versa (T2A), has recently attracted much research attention. Existing methods typically aggregate information from each modality into a single vector for matching, but this sacrifices local details and can hardly capture intricate relationships within and between modalities. Furthermore, current ATR datasets lack comprehensive alignment information, and simple binary contrastive learning labels overlook the measurement of fine-grained semantic differences between samples. To counter these challenges, we present a novel ATR framework that comprehensively captures the matching relationships of multimodal information from different perspectives and finer granularities. Specifically, a fine-grained alignment method is introduced, achieving a more detail-oriented matching through a multiscale process from local to global levels to capture meticulous cross-modal relationships. In addition, we pi
Authors
(none)
Tags
Stats
Related papers
- Improving Audio-text Retrieval Via Hierarchical Cross-modal Interaction And Auxiliary Captions (2023)0.00
- Contrastive Latent Space Reconstruction Learning For Audio-text Retrieval (2023)3.58
- ASK: Adaptive Self-improving Knowledge Framework For Audio Text Retrieval (2025)0.00
- From Contrast To Commonality: Audio Commonality Captioning For Enhanced Audio-text Cross-modal Understanding In Multimodal Llms (2025)0.00
- Complete Cross-triplet Loss In Label Space For Audio-visual Cross-modal Retrieval (2022)5.84
- Killing Two Birds With One Stone: Can An Audio Captioning System Also Be Used For Audio-text Retrieval? (2023)0.00
- Estimated Audio-caption Correspondences Improve Language-based Audio Retrieval (2024)0.00
- Perfect Match: Improved Cross-modal Embeddings For Audio-visual Synchronisation (2018)14.19