Classifier-guided Captioning Across Modalities
2025 Β· Ariel Shaulov, Tal Shaharabany, Eitan Shaar, et al.
Abstract
Most current captioning systems use language models trained on data from specific settings, such as image-based captioning via Amazon Mechanical Turk, limiting their ability to generalize to other modality distributions and contexts. This limitation hinders performance in tasks like audio or video captioning, where different semantic cues are needed. Addressing this challenge is crucial for creating more adaptable and versatile captioning frameworks applicable across diverse real-world contexts. In this work, we introduce a method to adapt captioning networks to the semantics of alternative settings, such as capturing audibility in audio captioning, where it is crucial to describe sounds and their sources. Our framework consists of two main components: (i) a frozen captioning system incorporating a language model (LM), and (ii) a text classifier that guides the captioning system. The classifier is trained on a dataset automatically generated by GPT-4, using tailored prompts specificall
Authors
(none)
Tags
Stats
Related papers
- Zero-shot Audio Captioning Via Audibility Guidance (2023)0.00
- Audio Captioning Using Pre-trained Large-scale Language Model Guided By Audio-based Similar Caption Retrieval (2020)0.00
- Diverse Audio Captioning Via Adversarial Training (2021)0.00
- Improving Audio Captioning Models With Fine-grained Audio Features, Text Embedding Supervision, And LLM Mix-up Augmentation (2023)8.82
- Automated Audio Captioning: An Overview Of Recent Progress And New Challenges (2022)12.10
- An Encoder-decoder Based Audio Captioning System With Transfer And Reinforcement Learning (2021)0.00
- Listen Carefully And Tell: An Audio Captioning System Based On Residual Learning And Gammatone Audio Representation (2020)0.00
- Enhancing Automated Audio Captioning Via Large Language Models With Optimized Audio Encoding (2024)5.24