Wavetransformer: A Novel Architecture For Audio Captioning Based On Learning Temporal And Time-frequency Information
2020 Β· An Tran, Konstantinos Drossos, Tuomas Virtanen
Abstract
Automated audio captioning (AAC) is a novel task, where a method takes as an input an audio sample and outputs a textual description (i.e. a caption) of its contents. Most AAC methods are adapted from from image captioning of machine translation fields. In this work we present a novel AAC novel method, explicitly focused on the exploitation of the temporal and time-frequency patterns in audio. We employ three learnable processes for audio encoding, two for extracting the local and temporal information, and one to merge the output of the previous two processes. To generate the caption, we employ the widely used Transformer decoder. We assess our method utilizing the freely available splits of Clotho dataset. Our results increase previously reported highest SPIDEr to 17.3, from 16.2.
Authors
(none)
Tags
Stats
Related papers
- Beyond The Status Quo: A Contemporary Survey Of Advances And Challenges In Audio Captioning (2022)9.03
- Evaluating Off-the-shelf Machine Listening And Natural Language Models For Automated Audio Captioning (2021)0.00
- Investigating Local And Global Information For Automated Audio Captioning With Transfer Learning (2021)0.00
- Conette: An Efficient Audio Captioning System Leveraging Multiple Datasets With Task Embedding (2023)11.11
- Killing Two Birds With One Stone: Can An Audio Captioning System Also Be Used For Audio-text Retrieval? (2023)0.00
- Dual Transformer Decoder Based Features Fusion Network For Automated Audio Captioning (2023)4.52
- Enhancing Automated Audio Captioning Via Large Language Models With Optimized Audio Encoding (2024)5.24
- Temporal Sub-sampling Of Audio Feature Sequences For Automated Audio Captioning (2020)0.00