Are Synthetic Videos Useful? A Benchmark For Retrieval-centric Evaluation Of Synthetic Videos
2025 Β· Zecheng Zhao, Selena Song, Tong Chen, et al.
Abstract
Text-to-video (T2V) synthesis has advanced rapidly, yet current evaluation metrics primarily capture visual quality and temporal consistency, offering limited insight into how synthetic videos perform in downstream tasks such as text-to-video retrieval (TVR). In this work, we introduce SynTVA, a new dataset and benchmark designed to evaluate the utility of synthetic videos for building retrieval models. Based on 800 diverse user queries derived from MSRVTT training split, we generate synthetic videos using state-of-the-art T2V models and annotate each video-text pair along four key semantic alignment dimensions: Object \& Scene, Action, Attribute, and Prompt Fidelity. Our evaluation framework correlates general video quality assessment (VQA) metrics with these alignment scores, and examines their predictive power for downstream TVR performance. To explore pathways of scaling up, we further develop an Auto-Evaluator to estimate alignment quality from existing metrics. Beyond benchmarkin
Authors
(none)
Tags
Stats
Related papers
- Fighting Fire With FIRE: Assessing The Validity Of Text-to-video Retrieval Benchmarks (2022)0.00
- CLIP2TV: Align, Match And Distill For Video-text Retrieval (2021)0.00
- Adversarial Video Promotion Against Text-to-video Retrieval (2025)1.40
- Tokenbinder: Text-video Retrieval With One-to-many Alignment Paradigm (2024)4.52
- T2vindexer: A Generative Video Indexer For Efficient Text-video Retrieval (2024)8.24
- Distilling Vision-language Models On Millions Of Videos (2024)7.50
- T2vparser: Adaptive Decomposition Tokens For Partial Alignment In Text To Video Retrieval (2025)0.95
- Video-colbert: Contextualized Late Interaction For Text-to-video Retrieval (2025)5.24