Zero-shot Audio Captioning Via Audibility Guidance
2023 Β· Tal Shaharabany, Ariel Shaulov, Lior Wolf
Abstract
The task of audio captioning is similar in essence to tasks such as image and video captioning. However, it has received much less attention. We propose three desiderata for captioning audio -- (i) fluency of the generated text, (ii) faithfulness of the generated text to the input audio, and the somewhat related (iii) audibility, which is the quality of being able to be perceived based only on audio. Our method is a zero-shot method, i.e., we do not learn to perform captioning. Instead, captioning occurs as an inference process that involves three networks that correspond to the three desired qualities: (i) A Large Language Model, in our case, for reasons of convenience, GPT-2, (ii) A model that provides a matching score between an audio file and a text, for which we use a multimodal matching network called ImageBind, and (iii) A text classifier, trained using a dataset we collected automatically by instructing GPT-4 with prompts designed to direct the generation of both audible and in
Authors
(none)
Tags
Stats
Related papers
- Zero-shot Audio Captioning With Audio-language Model Guidance And Audio Context Keywords (2023)2.60
- Classifier-guided Captioning Across Modalities (2025)0.00
- Audio Captioning Using Pre-trained Large-scale Language Model Guided By Audio-based Similar Caption Retrieval (2020)0.00
- Automated Audio Captioning: An Overview Of Recent Progress And New Challenges (2022)12.10
- Listen Carefully And Tell: An Audio Captioning System Based On Residual Learning And Gammatone Audio Representation (2020)0.00
- Improving Audio Captioning Models With Fine-grained Audio Features, Text Embedding Supervision, And LLM Mix-up Augmentation (2023)8.82
- An Encoder-decoder Based Audio Captioning System With Transfer And Reinforcement Learning (2021)0.00
- Parameter Efficient Audio Captioning With Faithful Guidance Using Audio-text Shared Latent Representation (2023)3.58