Temporal Sub-sampling Of Audio Feature Sequences For Automated Audio Captioning
2020 Β· Khoa Nguyen, Konstantinos Drossos, Tuomas Virtanen
Abstract
Audio captioning is the task of automatically creating a textual description for the contents of a general audio signal. Typical audio captioning methods rely on deep neural networks (DNNs), where the target of the DNN is to map the input audio sequence to an output sequence of words, i.e. the caption. Though, the length of the textual description is considerably less than the length of the audio signal, for example 10 words versus some thousands of audio feature vectors. This clearly indicates that an output word corresponds to multiple input feature vectors. In this work we present an approach that focuses on explicitly taking advantage of this difference of lengths between sequences, by applying a temporal sub-sampling to the audio input sequence. We employ a sequence-to-sequence method, which uses a fixed-length vector as an output from the encoder, and we apply temporal sub-sampling between the RNNs of the encoder. We evaluate the benefit of our approach by employing the freely av
Authors
(none)
Tags
Stats
Related papers
- Wavetransformer: A Novel Architecture For Audio Captioning Based On Learning Temporal And Time-frequency Information (2020)0.00
- Automated Audio Captioning: An Overview Of Recent Progress And New Challenges (2022)12.10
- Automated Audio Captioning With Recurrent Neural Networks (2017)13.97
- Multi-task Regularization Based On Infrequent Classes For Audio Captioning (2020)0.00
- An Encoder-decoder Based Audio Captioning System With Transfer And Reinforcement Learning (2021)0.00
- Improving The Performance Of Automated Audio Captioning Via Integrating The Acoustic And Semantic Information (2021)2.00
- Conette: An Efficient Audio Captioning System Leveraging Multiple Datasets With Task Embedding (2023)11.11
- Audio Word2vec: Unsupervised Learning Of Audio Segment Representations Using Sequence-to-sequence Autoencoder (2016)0.00