Enhancing Automated Audio Captioning Via Large Language Models With Optimized Audio Encoding
2024 Β· Jizhong Liu, Gang Li, Junbo Zhang, et al.
Abstract
Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED) is used to improve the effectivity of acoustic tokens, with a querying transformer (Q-Former) bridging the modality gap to LLM and compress acoustic tokens; 2) we investigate the advantages of using a Llama 2 with 7B parameters as the decoder; 3) another pre-trained LLM corrects text errors caused by insufficient training data and annotation ambiguities. Both the audio encoder and text decoder are optimized by low-rank adaptation (LoRA). Experiments show that each of these enhancements is effective. Our method obtains a 33.0 SPIDEr-FL score, outperforming the winner of DCASE 2023 Task 6A.
Authors
(none)
Tags
Stats
Related papers
- Improving Audio Captioning Models With Fine-grained Audio Features, Text Embedding Supervision, And LLM Mix-up Augmentation (2023)8.82
- SLAM-AAC: Enhancing Audio Captioning With Paraphrasing Augmentation And Clap-refine Through Llms (2024)0.00
- CLAIR-A: Leveraging Large Language Models To Judge Audio Captions (2024)2.00
- Performance Improvement Of Language-queried Audio Source Separation Based On Caption Augmentation From Large Language Models For DCASE Challenge 2024 Task 9 (2024)0.00
- Beyond The Status Quo: A Contemporary Survey Of Advances And Challenges In Audio Captioning (2022)9.03
- Improving The Performance Of Automated Audio Captioning Via Integrating The Acoustic And Semantic Information (2021)2.00
- An Encoder-decoder Based Audio Captioning System With Transfer And Reinforcement Learning (2021)0.00
- Audiosetcaps: An Enriched Audio-caption Dataset Using Automated Generation Pipeline With Large Audio And Language Models (2024)13.44