Semantic-aware Confidence Calibration For Automated Audio Captioning
2025 Β· Lucas Dunker, Sai Akshay Menta, Snigdha Mohana Addepalli, et al.
Abstract
Automated audio captioning models frequently produce overconfident predictions regardless of semantic accuracy, limiting their reliability in deployment. This deficiency stems from two factors: evaluation metrics based on n-gram overlap that fail to capture semantic correctness, and the absence of calibrated confidence estimation. We present a framework that addresses both limitations by integrating confidence prediction into audio captioning and redefining correctness through semantic similarity. Our approach augments a Whisper-based audio captioning model with a learned confidence prediction head that estimates uncertainty from decoder hidden states. We employ CLAP audio-text embeddings and sentence transformer similarities (FENSE) to define semantic correctness, enabling Expected Calibration Error (ECE) computation that reflects true caption quality rather than surface-level text overlap. Experiments on Clotho v2 demonstrate that confidence-guided beam search with semantic evaluatio
Authors
(none)
Tags
Stats
Related papers
- Resource-efficient Reference-free Evaluation Of Audio Captions (2024)0.00
- CLAIR-A: Leveraging Large Language Models To Judge Audio Captions (2024)2.00
- Accurate And Reliable Confidence Estimation Based On Non-autoregressive End-to-end Speech Recognition System (2023)4.52
- Can Audio Captions Be Evaluated With Image Caption Metrics? (2021)13.54
- Confidence Estimation For Attention-based Sequence-to-sequence Models For Speech Recognition (2020)11.49
- Cosyaudio: Improving Audio Generation With Confidence Scores And Synthetic Captions (2025)0.00
- Improving Audio Captioning Models With Fine-grained Audio Features, Text Embedding Supervision, And LLM Mix-up Augmentation (2023)8.82
- Enclap: Combining Neural Audio Codec And Audio-text Joint Embedding For Automated Audio Captioning (2024)14.03