Enclap: Combining Neural Audio Codec And Audio-text Joint Embedding For Automated Audio Captioning
2024 Β· Jaeyeon Kim, Jaeyoon Jung, Jinjoo Lee, et al.
Abstract
We propose EnCLAP, a novel framework for automated audio captioning. EnCLAP employs two acoustic representation models, EnCodec and CLAP, along with a pretrained language model, BART. We also introduce a new training objective called masked codec modeling that improves acoustic awareness of the pretrained language model. Experimental results on AudioCaps and Clotho demonstrate that our model surpasses the performance of baseline models. Source code will be available at https://github.com/jaeyeonkim99/EnCLAP . An online demo is available at https://huggingface.co/spaces/enclap-team/enclap .
Authors
(none)
Tags
Stats
Code
Related papers
- Interactive Audio-text Representation For Automated Audio Captioning With Contrastive Learning (2022)0.00
- Drcap: Decoding CLAP Latents With Retrieval-augmented Generation For Zero-shot Audio Captioning (2024)6.34
- Audiosetcaps: An Enriched Audio-caption Dataset Using Automated Generation Pipeline With Large Audio And Language Models (2024)13.44
- M2D-CLAP: Masked Modeling Duo Meets CLAP For Learning General-purpose Audio-language Representation (2024)7.81
- RECAP: Retrieval-augmented Audio Captioning (2023)9.41
- SLAM-AAC: Enhancing Audio Captioning With Paraphrasing Augmentation And Clap-refine Through Llms (2024)0.00
- Conette: An Efficient Audio Captioning System Leveraging Multiple Datasets With Task Embedding (2023)11.11
- CLAIR-A: Leveraging Large Language Models To Judge Audio Captions (2024)2.00