Multiscale Matching Driven By Cross-modal Similarity Consistency For Audio-text Retrieval
2024 Β· Qian Wang, Jia-Chen Gu, Zhen-Hua Ling
Abstract
Audio-text retrieval (ATR), which retrieves a relevant caption given an audio clip (A2T) and vice versa (T2A), has recently attracted much research attention. Existing methods typically aggregate information from each modality into a single vector for matching, but this sacrifices local details and can hardly capture intricate relationships within and between modalities. Furthermore, current ATR datasets lack comprehensive alignment information, and simple binary contrastive learning labels overlook the measurement of fine-grained semantic differences between samples. To counter these challenges, we present a novel ATR framework that comprehensively captures the matching relationships of multimodal information from different perspectives and finer granularities. Specifically, a fine-grained alignment method is introduced, achieving a more detail-oriented matching through a multiscale process from local to global levels to capture meticulous cross-modal relationships. In addition, we pi
Authors
(none)
Tags
Stats
Related papers
- Estimated Audio-caption Correspondences Improve Language-based Audio Retrieval (2024)0.00
- Video And Audio Are Images: A Cross-modal Mixer For Original Data On Video-audio Retrieval (2023)7.16
- Perfect Match: Improved Cross-modal Embeddings For Audio-visual Synchronisation (2018)14.19
- Transcending Fusion: A Multi-scale Alignment Method For Remote Sensing Image-text Retrieval (2024)11.92
- Deep Triplet Neural Networks With Cluster-cca For Audio-visual Cross-modal Retrieval (2019)12.61
- Adversarial Cross-modal Retrieval Via Learning And Transferring Single-modal Similarities (2019)8.60
- Maximal Matching Matters: Preventing Representation Collapse For Robust Cross-modal Retrieval (2025)2.26
- Exploring A Fine-grained Multiscale Method For Cross-modal Remote Sensing Image Retrieval (2022)16.73