Divesound: Llm-assisted Automatic Taxonomy Construction For Diverse Audio Generation
2024 Β· Baihan Li, Zeyu Xie, Xuenan Xu, et al.
Abstract
Audio generation has attracted significant attention. Despite remarkable enhancement in audio quality, existing models overlook diversity evaluation. This is partially due to the lack of a systematic sound class diversity framework and a matching dataset. To address these issues, we propose DiveSound, a novel framework for constructing multimodal datasets with in-class diversified taxonomy, assisted by large language models. As both textual and visual information can be utilized to guide diverse generation, DiveSound leverages multimodal contrastive representations in data construction. Our framework is highly autonomous and can be easily scaled up. We provide a textaudio-image aligned diversity dataset whose sound event class tags have an average of 2.42 subcategories. Text-to-audio experiments on the constructed dataset show a substantial increase of diversity with the help of the guidance of visual information.
Authors
(none)
Tags
Stats
Related papers
- Audiosetmix: Enhancing Audio-language Datasets With Llm-assisted Augmentations (2024)0.00
- Diverse And Aligned Audio-to-video Generation Via Text-to-video Model Adaptation (2023)11.19
- Sound-vecaps: Improving Audio Generation With Visual Enhanced Captions (2024)7.16
- Auto-acd: A Large-scale Dataset For Audio-language Representation Learning (2023)10.74
- Audio-omni: Extending Multi-modal Understanding To Versatile Audio Generation And Editing (2026)0.00
- Augment, Drop & Swap: Improving Diversity In LLM Captions For Efficient Music-text Representation Learning (2024)0.00
- Dreamaudio: Customized Text-to-audio Generation With Diffusion Models (2026)0.00
- Voicedit: Dual-condition Diffusion Transformer For Environment-aware Speech Synthesis (2024)5.84