Soloaudio: Target Sound Extraction With Language-oriented Audio Diffusion Transformer
2024 Β· Helin Wang, Jiarui Hai, Yen-Ju Lu, et al.
Abstract
In this paper, we introduce SoloAudio, a novel diffusion-based generative model for target sound extraction (TSE). Our approach trains latent diffusion models on audio, replacing the previous U-Net backbone with a skip-connected Transformer that operates on latent features. SoloAudio supports both audio-oriented and language-oriented TSE by utilizing a CLAP model as the feature extractor for target sounds. Furthermore, SoloAudio leverages synthetic audio generated by state-of-the-art text-to-audio models for training, demonstrating strong generalization to out-of-domain data and unseen sound events. We evaluate this approach on the FSD Kaggle 2018 mixture dataset and real data from AudioSet, where SoloAudio achieves the state-of-the-art results on both in-domain and out-of-domain data, and exhibits impressive zero-shot and few-shot capabilities. Source code and demos are released.
Authors
(none)
Tags
Stats
Related papers
- Language-queried Target Sound Extraction Without Parallel Training Data (2024)5.24
- Controlaudio: Tackling Text-guided, Timing-indicated And Intelligible Audio Generation Via Progressive Diffusion Modeling (2025)0.00
- Ezaudio: Enhancing Text-to-audio Generation With Efficient Diffusion Transformer (2024)7.50
- Score Distillation Sampling For Audio: Source Separation, Synthesis, And Beyond (2025)0.00
- Audioldm 2: Learning Holistic Audio Generation With Self-supervised Pretraining (2023)0.00
- Audiotoken: Adaptation Of Text-conditioned Diffusion Models For Audio-to-image Generation (2023)9.76
- TSELM: Target Speaker Extraction Using Discrete Tokens And Language Models (2024)0.00
- Voicedit: Dual-condition Diffusion Transformer For Environment-aware Speech Synthesis (2024)5.84