Can Audio Captions Be Evaluated With Image Caption Metrics?
2021 Β· Zelin Zhou, Zhiling Zhang, Xuenan Xu, et al.
Abstract
Automated audio captioning aims at generating textual descriptions for an audio clip. To evaluate the quality of generated audio captions, previous works directly adopt image captioning metrics like SPICE and CIDEr, without justifying their suitability in this new domain, which may mislead the development of advanced models. This problem is still unstudied due to the lack of human judgment datasets on caption quality. Therefore, we firstly construct two evaluation benchmarks, AudioCaps-Eval and Clotho-Eval. They are established with pairwise comparison instead of absolute rating to achieve better inter-annotator agreement. Current metrics are found in poor correlation with human annotations on these datasets. To overcome their limitations, we propose a metric named FENSE, where we combine the strength of Sentence-BERT in capturing similarity, and a novel Error Detector to penalize erroneous sentences for robustness. On the newly established benchmarks, FENSE outperforms current metrics
Authors
(none)
Tags
Stats
Related papers
- Resource-efficient Reference-free Evaluation Of Audio Captions (2024)0.00
- Investigations In Audio Captioning: Addressing Vocabulary Imbalance And Evaluating Suitability Of Language-centric Performance Metrics (2022)0.00
- Semantic-aware Confidence Calibration For Automated Audio Captioning (2025)0.00
- CLAIR-A: Leveraging Large Language Models To Judge Audio Captions (2024)2.00
- Evaluating Off-the-shelf Machine Listening And Natural Language Models For Automated Audio Captioning (2021)0.00
- Text-based Audio Retrieval By Learning From Similarities Between Audio Captions (2024)2.26
- Audio Caption: Listen And Tell (2019)10.97
- Text-to-audio Grounding Based Novel Metric For Evaluating Audio Caption Similarity (2022)0.00