Interactive Audio-text Representation For Automated Audio Captioning With Contrastive Learning
2022 Β· Chen Chen, Nana Hou, Yuchen Hu, et al.
Abstract
Automated Audio captioning (AAC) is a cross-modal task that generates natural language to describe the content of input audio. Most prior works usually extract single-modality acoustic features and are therefore sub-optimal for the cross-modal decoding task. In this work, we propose a novel AAC system called CLIP-AAC to learn interactive cross-modality representation with both acoustic and textual information. Specifically, the proposed CLIP-AAC introduces an audio-head and a text-head in the pre-trained encoder to extract audio-text information. Furthermore, we also apply contrastive learning to narrow the domain difference by learning the correspondence between the audio signal and its paired captions. Experimental results show that the proposed CLIP-AAC approach surpasses the best baseline by a significant margin on the Clotho dataset in terms of NLP evaluation metrics. The ablation study indicates that both the pre-trained model and contrastive learning contribute to the performanc
Authors
(none)
Tags
Stats
Related papers
- Enclap: Combining Neural Audio Codec And Audio-text Joint Embedding For Automated Audio Captioning (2024)14.03
- SLAM-AAC: Enhancing Audio Captioning With Paraphrasing Augmentation And Clap-refine Through Llms (2024)0.00
- CLAIR-A: Leveraging Large Language Models To Judge Audio Captions (2024)2.00
- Drcap: Decoding CLAP Latents With Retrieval-augmented Generation For Zero-shot Audio Captioning (2024)6.34
- Beyond The Status Quo: A Contemporary Survey Of Advances And Challenges In Audio Captioning (2022)9.03
- Investigating Local And Global Information For Automated Audio Captioning With Transfer Learning (2021)0.00
- From Contrast To Commonality: Audio Commonality Captioning For Enhanced Audio-text Cross-modal Understanding In Multimodal Llms (2025)0.00
- Improving Audio-text Retrieval Via Hierarchical Cross-modal Interaction And Auxiliary Captions (2023)0.00