Diverse Audio Captioning Via Adversarial Training
2021 Β· Xinhao Mei, Xubo Liu, Jianyuan Sun, et al.
Abstract
Audio captioning aims at generating natural language descriptions for audio clips automatically. Existing audio captioning models have shown promising improvement in recent years. However, these models are mostly trained via maximum likelihood estimation (MLE),which tends to make captions generic, simple and deterministic. As different people may describe an audio clip from different aspects using distinct words and grammars, we argue that an audio captioning system should have the ability to generate diverse captions for a fixed audio clip and across similar audio clips. To address this problem, we propose an adversarial training framework for audio captioning based on a conditional generative adversarial network (C-GAN), which aims at improving the naturalness and diversity of generated captions. Unlike processing data of continuous values in a classical GAN, a sentence is composed of discrete tokens and the discrete sampling process is non-differentiable. To address this issue, poli
Authors
(none)
Tags
Stats
Related papers
- Classifier-guided Captioning Across Modalities (2025)0.00
- Diverse And Aligned Audio-to-video Generation Via Text-to-video Model Adaptation (2023)11.19
- High Fidelity Speech Synthesis With Adversarial Networks (2019)0.00
- Improving Audio Captioning Models With Fine-grained Audio Features, Text Embedding Supervision, And LLM Mix-up Augmentation (2023)8.82
- VQCPC-GAN: Variable-length Adversarial Audio Synthesis Using Vector-quantized Contrastive Predictive Coding (2021)5.84
- Multi-task Adversarial Training Algorithm For Multi-speaker Neural Text-to-speech (2022)0.00
- Multi-spectrogan: High-diversity And High-fidelity Spectrogram Generation With Adversarial Style Combination For Speech Synthesis (2020)0.00
- Automated Audio Captioning: An Overview Of Recent Progress And New Challenges (2022)12.10