Audio Caption In A Car Setting With A Sentence-level Loss
2019 Β· Xuenan Xu, Heinrich Dinkel, Mengyue Wu, et al.
Abstract
Captioning has attracted much attention in image and video understanding while a small amount of work examines audio captioning. This paper contributes a Mandarin-annotated dataset for audio captioning within a car scene. A sentence-level loss is proposed to be used in tandem with a GRU encoder-decoder model to generate captions with higher semantic similarity to human annotations. We evaluate the model on the newly-proposed Car dataset, a previously published Mandarin Hospital dataset and the Joint dataset, indicating its generalization capability across different scenes. An improvement in all metrics can be observed, including classical natural language generation (NLG) metrics, sentence richness and human evaluation ratings. However, though detailed audio captions can now be automatically generated, human annotations still outperform model captions on many aspects.
Authors
(none)
Tags
Stats
Related papers
- Audio Caption: Listen And Tell (2019)10.97
- An Encoder-decoder Based Audio Captioning System With Transfer And Reinforcement Learning (2021)0.00
- Automated Audio Captioning: An Overview Of Recent Progress And New Challenges (2022)12.10
- Audio Captioning Using Pre-trained Large-scale Language Model Guided By Audio-based Similar Caption Retrieval (2020)0.00
- Audio Captioning Using Gated Recurrent Units (2020)0.00
- Auto-acd: A Large-scale Dataset For Audio-language Representation Learning (2023)10.74
- Listen Carefully And Tell: An Audio Captioning System Based On Residual Learning And Gammatone Audio Representation (2020)0.00
- Crowdsourcing A Dataset Of Audio Captions (2019)8.60