Audioeval: Automatic Dual-perspective And Multi-dimensional Evaluation Of Text-to-audio-generation
2025 Β· Hui Wang, Jinghua Zhao, Junyang Cheng, et al.
Abstract
Text-to-audio (TTA) generation is advancing rapidly, but evaluation remains challenging because human listening studies are expensive and existing automatic metrics capture only limited aspects of perceptual quality. We introduce AudioEval, a large-scale TTA evaluation dataset with 4,200 generated audio samples (11.7 hours) from 24 systems and 126,000 ratings collected from both experts and non-experts across five dimensions: enjoyment, usefulness, complexity, quality, and text alignment. Using AudioEval, we benchmark diverse automatic evaluators to compare perspective- and dimension-level differences across model families. We also propose Qwen-DisQA as a strong reference baseline: it jointly processes prompts and generated audio to predict multi-dimensional ratings for both annotator groups, modeling rater disagreement via distributional prediction and achieving strong performance. We will release AudioEval to support future research in TTA evaluation.
Authors
(none)
Tags
Stats
Related papers
- Mtavg-bench: A Comprehensive Benchmark For Evaluating Multi-talker Dialogue-centric Audio-video Generation (2026)0.00
- ETTA: Elucidating The Design Space Of Text-to-audio Models (2024)0.00
- Audio-agent: Leveraging Llms For Audio Generation, Editing And Composition (2024)0.00
- Controlaudio: Tackling Text-guided, Timing-indicated And Intelligible Audio Generation Via Progressive Diffusion Modeling (2025)0.00
- Cosyaudio: Improving Audio Generation With Confidence Scores And Synthetic Captions (2025)0.00
- Ezaudio: Enhancing Text-to-audio Generation With Efficient Diffusion Transformer (2024)7.50
- All That Glitters Is Not Audio: Rethinking Text Priors And Audio Reliance In Audio-language Evaluation (2026)0.00
- Divesound: Llm-assisted Automatic Taxonomy Construction For Diverse Audio Generation (2024)2.26