Improving Audio-text Retrieval Via Hierarchical Cross-modal Interaction And Auxiliary Captions
2023 Β· Yifei Xin, Yuexian Zou
Abstract
Most existing audio-text retrieval (ATR) methods focus on constructing contrastive pairs between whole audio clips and complete caption sentences, while ignoring fine-grained cross-modal relationships, e.g., short segments and phrases or frames and words. In this paper, we introduce a hierarchical cross-modal interaction (HCI) method for ATR by simultaneously exploring clip-sentence, segment-phrase, and frame-word relationships, achieving a comprehensive multi-modal semantic comparison. Besides, we also present a novel ATR framework that leverages auxiliary captions (AC) generated by a pretrained captioner to perform feature interaction between audio and generated captions, which yields enhanced audio representations and is complementary to the original ATR matching branch. The audio and generated captions can also form new audio-text pairs as data augmentation for training. Experiments show that our HCI significantly improves the ATR performance. Moreover, our AC framework also shows
Authors
(none)
Tags
Stats
Related papers
- Multiscale Matching Driven By Cross-modal Similarity Consistency For Audio-text Retrieval (2024)4.52
- Killing Two Birds With One Stone: Can An Audio Captioning System Also Be Used For Audio-text Retrieval? (2023)0.00
- Interactive Audio-text Representation For Automated Audio Captioning With Contrastive Learning (2022)0.00
- Watch, Listen, And Describe: Globally And Locally Aligned Cross-modal Attentions For Video Captioning (2018)12.87
- Enhancing Retrieval-augmented Audio Captioning With Generation-assisted Multimodal Querying And Progressive Learning (2024)3.58
- Advancing Natural-language Based Audio Retrieval With Passt And Large Audio-caption Data Sets (2023)0.00
- From Contrast To Commonality: Audio Commonality Captioning For Enhanced Audio-text Cross-modal Understanding In Multimodal Llms (2025)0.00
- Improving Audio Captioning Models With Fine-grained Audio Features, Text Embedding Supervision, And LLM Mix-up Augmentation (2023)8.82