Text-to-audio Generation Synchronized With Videos
2024 Β· Shentong Mo, Jing Shi, Yapeng Tian
Abstract
In recent times, the focus on text-to-audio (TTA) generation has intensified, as researchers strive to synthesize audio from textual descriptions. However, most existing methods, though leveraging latent diffusion models to learn the correlation between audio and text embeddings, fall short when it comes to maintaining a seamless synchronization between the produced audio and its video. This often results in discernible audio-visual mismatches. To bridge this gap, we introduce a groundbreaking benchmark for Text-to-Audio generation that aligns with Videos, named T2AV-Bench. This benchmark distinguishes itself with three novel metrics dedicated to evaluating visual alignment and temporal consistency. To complement this, we also present a simple yet effective video-aligned TTA generation model, namely T2AV. Moving beyond traditional methods, T2AV refines the latent diffusion approach by integrating visual-aligned text embeddings as its conditional foundation. It employs a temporal multi-
Authors
(none)
Tags
Stats
Related papers
- Aadiff: Audio-aligned Video Synthesis With Text-to-image Diffusion (2023)0.00
- Mtavg-bench: A Comprehensive Benchmark For Evaluating Multi-talker Dialogue-centric Audio-video Generation (2026)0.00
- Visualtts: TTS With Accurate Lip-speech Synchronization For Automatic Voice Over (2021)9.41
- Taming Text-to-sounding Video Generation Via Advanced Modality Condition And Interaction (2025)0.00
- Video-to-audio Generation With Hidden Alignment (2024)0.00
- 3mdit: Unified Tri-modal Diffusion Transformer For Text-driven Synchronized Audio-video Generation (2025)0.00
- Savgbench: Benchmarking Spatially Aligned Audio-video Generation (2024)0.00
- Diverse And Aligned Audio-to-video Generation Via Text-to-video Model Adaptation (2023)11.19