Mtavg-bench: A Comprehensive Benchmark For Evaluating Multi-talker Dialogue-centric Audio-video Generation
2026 Β· Yang-Hao Zhou, Haitian Li, Rexar Lin, et al.
Abstract
Recent advances in text-to-audio-video (T2AV) generation have enabled models to synthesize audio-visual videos with multi-participant dialogues. However, existing evaluation benchmarks remain largely designed for human-recorded videos or single-speaker settings. As a result, potential errors that occur in generated multi-talker dialogue videos, such as identity drift, unnatural turn transitions, and audio-visual misalignment, cannot be effectively captured and analyzed. To address this issue, we introduce MTAVG-Bench, a benchmark for evaluating audio-visual multi-speaker dialogue generation. MTAVG-Bench is built via a semi-automatic pipeline, where 1.8k videos are generated using multiple popular models with carefully designed prompts, yielding 2.4k manually annotated QA pairs. The benchmark evaluates multi-speaker dialogue generation at four levels: audio-visual signal fidelity, temporal attribute consistency, social interaction, and cinematic expression. We benchmark 12 proprietary a
Authors
(none)
Tags
Stats
Related papers
- Text-to-audio Generation Synchronized With Videos (2024)0.00
- Savgbench: Benchmarking Spatially Aligned Audio-video Generation (2024)0.00
- Audioeval: Automatic Dual-perspective And Multi-dimensional Evaluation Of Text-to-audio-generation (2025)0.00
- VCB Bench: An Evaluation Benchmark For Audio-grounded Large Language Model Conversational Agents (2025)0.00
- Deepaudio-v1:towards Multi-modal Multi-stage End-to-end Video To Speech And Audio Generation (2025)0.00
- Mpceval: A Benchmark For Multi-party Conversation Generation (2026)0.00
- Vocalbench: Benchmarking The Vocal Conversational Abilities For Speech Interaction Models (2025)0.00
- Javisdit++: Unified Modeling And Optimization For Joint Audio-video Generation (2026)0.00