Parameter Efficient Audio Captioning With Faithful Guidance Using Audio-text Shared Latent Representation
2023 Β· Arvind Krishna Sridhar, Yinyi Guo, Erik Visser, et al.
Abstract
There has been significant research on developing pretrained transformer architectures for multimodal-to-text generation tasks. Albeit performance improvements, such models are frequently overparameterized, hence suffer from hallucination and large memory footprint making them challenging to deploy on edge devices. In this paper, we address both these issues for the application of automated audio captioning. First, we propose a data augmentation technique for generating hallucinated audio captions and show that similarity based on an audio-text shared latent space is suitable for detecting hallucination. Then, we propose a parameter efficient inference time faithful decoding algorithm that enables smaller audio captioning models with performance equivalent to larger models trained with more data. During the beam decoding step, the smaller model utilizes an audio-text shared latent representation to semantically align the generated text with corresponding input audio. Faithful guidance
Authors
(none)
Tags
Stats
Related papers
- Improving Audio Captioning Models With Fine-grained Audio Features, Text Embedding Supervision, And LLM Mix-up Augmentation (2023)8.82
- Audio Captioning Using Pre-trained Large-scale Language Model Guided By Audio-based Similar Caption Retrieval (2020)0.00
- Advancing Natural-language Based Audio Retrieval With Passt And Large Audio-caption Data Sets (2023)0.00
- Taming Data And Transformers For Audio Generation (2024)0.00
- Enhancing Automated Audio Captioning Via Large Language Models With Optimized Audio Encoding (2024)5.24
- A Better Use Of Audio-visual Cues: Dense Video Captioning With Bi-modal Transformer (2020)10.61
- Dual Transformer Decoder Based Features Fusion Network For Automated Audio Captioning (2023)4.52
- Wavetransformer: A Novel Architecture For Audio Captioning Based On Learning Temporal And Time-frequency Information (2020)0.00