Clotho: An Audio Captioning Dataset
2019 Β· Konstantinos Drossos, Samuel Lipping, Tuomas Virtanen
Abstract
Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).
Authors
(none)
Tags
Stats
Related papers
- Crowdsourcing A Dataset Of Audio Captions (2019)8.60
- Automated Audio Captioning And Language-based Audio Retrieval (2022)0.00
- Advancing Natural-language Based Audio Retrieval With Passt And Large Audio-caption Data Sets (2023)0.00
- Evaluating Off-the-shelf Machine Listening And Natural Language Models For Automated Audio Captioning (2021)0.00
- Conette: An Efficient Audio Captioning System Leveraging Multiple Datasets With Task Embedding (2023)11.11
- Wavetransformer: A Novel Architecture For Audio Captioning Based On Learning Temporal And Time-frequency Information (2020)0.00
- Audio Caption: Listen And Tell (2019)10.97
- Interactive Audio-text Representation For Automated Audio Captioning With Contrastive Learning (2022)0.00