Improving Audio Captioning Models With Fine-grained Audio Features, Text Embedding Supervision, And LLM Mix-up Augmentation
2023 Β· Shih-Lun Wu, Xuankai Chang, Gordon Wichern, et al.
Abstract
Automated audio captioning (AAC) aims to generate informative descriptions for various sounds from nature and/or human activities. In recent years, AAC has quickly attracted research interest, with state-of-the-art systems now relying on a sequence-to-sequence (seq2seq) backbone powered by strong models such as Transformers. Following the macro-trend of applied machine learning research, in this work, we strive to improve the performance of seq2seq AAC models by extensively leveraging pretrained models and large language models (LLMs). Specifically, we utilize BEATs to extract fine-grained audio features. Then, we employ Instructor LLM to fetch text embeddings of captions, and infuse their language-modality knowledge into BEATs audio features via an auxiliary InfoNCE loss function. Moreover, we propose a novel data augmentation method that uses ChatGPT to produce caption mix-ups (i.e., grammatical and compact combinations of two captions) which, together with the corresponding audio mi
Authors
(none)
Tags
Stats
Related papers
- Enhancing Automated Audio Captioning Via Large Language Models With Optimized Audio Encoding (2024)5.24
- SLAM-AAC: Enhancing Audio Captioning With Paraphrasing Augmentation And Clap-refine Through Llms (2024)0.00
- Improving The Performance Of Automated Audio Captioning Via Integrating The Acoustic And Semantic Information (2021)2.00
- Beyond The Status Quo: A Contemporary Survey Of Advances And Challenges In Audio Captioning (2022)9.03
- Investigating Local And Global Information For Automated Audio Captioning With Transfer Learning (2021)0.00
- Evaluating Off-the-shelf Machine Listening And Natural Language Models For Automated Audio Captioning (2021)0.00
- Audiosetmix: Enhancing Audio-language Datasets With Llm-assisted Augmentations (2024)0.00
- An Encoder-decoder Based Audio Captioning System With Transfer And Reinforcement Learning (2021)0.00