Automatic Audio Captioning Using Attention Weighted Event Based Embeddings
2022 Β· Swapnil Bhosale, Rupayan Chakraborty, Sunil Kumar Kopparapu
Abstract
Automatic Audio Captioning (AAC) refers to the task of translating audio into a natural language that describes the audio events, source of the events and their relationships. The limited samples in AAC datasets at present, has set up a trend to incorporate transfer learning with Audio Event Detection (AED) as a parent task. Towards this direction, in this paper, we propose an encoder-decoder architecture with light-weight (i.e. with lesser learnable parameters) Bi-LSTM recurrent layers for AAC and compare the performance of two state-of-the-art pre-trained AED models as embedding extractors. Our results show that an efficient AED based embedding extractor combined with temporal attention and augmentation techniques is able to surpass existing literature with computationally intensive architectures. Further, we provide evidence of the ability of the non-uniform attention weighted encoding generated as a part of our model to facilitate the decoder glance over specific sections of the au
Authors
(none)
Tags
Stats
Related papers
- Improving The Performance Of Automated Audio Captioning Via Integrating The Acoustic And Semantic Information (2021)2.00
- Improving Audio Captioning Models With Fine-grained Audio Features, Text Embedding Supervision, And LLM Mix-up Augmentation (2023)8.82
- Enhancing Automated Audio Captioning Via Large Language Models With Optimized Audio Encoding (2024)5.24
- An Encoder-decoder Based Audio Captioning System With Transfer And Reinforcement Learning (2021)0.00
- Beyond The Status Quo: A Contemporary Survey Of Advances And Challenges In Audio Captioning (2022)9.03
- Wavetransformer: A Novel Architecture For Audio Captioning Based On Learning Temporal And Time-frequency Information (2020)0.00
- Automated Audio Captioning Via Fusion Of Low- And High- Dimensional Features (2022)0.00
- Investigating Local And Global Information For Automated Audio Captioning With Transfer Learning (2021)0.00