Awesome Text-to-Video
Text-to-Video is one of the most active areas in Awesome Generative Models β 630 papers in this collection, evaluated on datasets like COCO, UCF101, T2I-CompBench. A strong starting point is "Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection".
Datasets & benchmarks
Key papers
- Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion
Transformers via In-Context Reflection (2025)Shufan Li et al.9.29
- Pushupbench: Your VLM Is Not Good At Counting Pushups (2026)Shengzhi Li, Jiarun Chen, Karun Sharma, et al.9.22
- Learning Few-Step Diffusion Models by Trajectory Distribution Matching (2025)Yihong Luo et al.8.84
- Avatar V: Scaling Video-Reference Avatar Video Generation (2026)Benjamin Liang et al.7.85
- A Review on Generative AI For Text-To-Image and Image-To-Image
Generation and Implications To Scientific Images (2025)Zineb Sordo and Eric Chagnon and Daniela Ushizima7.64
- CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer (2024)Zhuoyi Yang et al.7.30
- On the Challenges and Opportunities in Generative AI (2024)Laura Manduchi et al.6.89
- Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation (2023)David Junhao Zhang et al.6.55
- Latte: Latent Diffusion Transformer for Video Generation (2024)Xin Ma et al.6.11
- Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video
Generation Control (2025)Zekai Gu et al.4.76
- SINE: SINgle Image Editing with Text-to-Image Diffusion Models (2022)Zhixing Zhang et al.4.48
- MCCD: Multi-Agent Collaboration-based Compositional Diffusion for
Complex Text-to-Image Generation (2025)Mingcheng Li et al.4.47
- CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation (2026)Sharath Girish et al.4.39
- Temporal Backtracking Search for Test-time Generative Video Reasoning (2026)Sejoon Jun et al.4.39
- Prompt2Effect: Training-Free Image-to-Video Model Specialization via LoRA Generation (2026)Xiaomeng Yang et al.4.39
- Mask, Sample, Revise: A Revisable CTMC Inference Stack for Guided Discrete Flow Matching Text-to-Speech (2026)Alef Iury Siqueira Ferreira et al.4.39
- FoleyGenEx: Unified Video-to-Audio Generation with Multi-Modal Control, Temporal Alignment, and Semantic Precision (2026)Shiyao Wang et al.4.39
- VideoWeave: Unlocking Geometric Consistency in Video Generation via Joint Geometry-Video Modeling (2026)Xunzhi Xiang et al.4.39
- CausalMotion: Structured Physical Reasoning as Keyframe and Trajectory Guidance for Training-Free Video Generation (2026)Sihan Zhuang et al.4.39
- ForceForget: Reinforcement Concept Removal for Enhancing Safety in Text-to-Image Models (2026)Dong Han et al.4.39
- Memento: Reconstruct to Remember for Consistent Long Video Generation (2026)Xuan Wei et al.4.39
- StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation
from Text (2024)Roberto Henschel et al.4.21
- CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects (2024)Zhao Wang et al.4.10
- Seeing It Before It Happens: In-Generation NSFW Detection for Diffusion-Based Text-to-Image Models (2025)Fan Yang et al.3.97
- StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization (2025)Gopalji Gaur et al.3.97
- TurboFill: Adapting Few-step Text-to-image Model for Fast Image
Inpainting (2025)Liangbin Xie et al.3.75
- Tutorial on Diffusion Models for Imaging and Vision (2024)Stanley H. Chan3.69
- LTX-Video: Realtime Video Latent Diffusion (2025)Yoav HaCohen et al.3.59
- Face-MakeUp: Multimodal Facial Prompts for Text-to-Image Generation (2025)Dawei Dai et al.3.59
- Making Time Editable in Video Diffusion Transformers (2026)Konstantin Kuklev et al.3.51
- Structural Energy Guidance for View-Consistent Text-to-3D Generation (2026)Qing Zhang et al.3.45
- {\Phi}-Noise: Training-Free Temporal Video Conditioning via Phase-Based Noise Manipulation (2026)Ofir Abramovich et al.3.45
- Paris 2.0: A Decentralized Diffusion Model for Video Generation (2026)Ali Rouzbayani et al.3.45
- Motion-Zero: Zero-Shot Moving Object Control Framework for
Diffusion-Based Video Generation (2024)Changgu Chen et al.2.92
- ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models (2025)Ozgur Kara et al.2.87
- VideoHandles: Editing 3D Object Compositions in Videos Using Video
Generative Priors (2025)Juil Koo et al.2.76
- DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture
Design in Text to Image Generation (2025)Chen Chen et al.2.76
- CAT Pruning: Cluster-Aware Token Pruning For Text-to-Image Diffusion
Models (2025)Xinle Cheng et al.2.71
- UniForm: A Unified Multi-Task Diffusion Transformer for Audio-Video Generation (2025)Lei Zhao et al.2.71
- HumanDiT: Pose-Guided Diffusion Transformer for Long-form Human Motion Video Generation (2025)Qijun Gan et al.2.71
- StyleBlend: Enhancing Style-Specific Content Creation in Text-to-Image
Diffusion Models (2025)Zichong Chen et al.2.71
- DiffExp: Efficient Exploration in Reward Fine-tuning for Text-to-Image
Diffusion Models (2025)Daewon Chae et al.2.71
- Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens (2025)Dongwon Kim et al.2.65
- Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models (2025)Weichen Fan et al.2.65
- Separate Motion from Appearance: Customizing Motion via Customizing Text-to-Video Diffusion Models (2025)Huijie Liu et al.2.65
- Tora: Trajectory-oriented Diffusion Transformer for Video Generation (2024)Zhenghao Zhang and Junchao Liao and Menghao Li and Zuozhuo Dai and Bingxue Qiu and Siyu Zhu and Long Qin and Weizhi Wang2.32
- Learning Temporally Consistent Video Depth from Video Diffusion Priors (2024)Jiahao Shao et al.2.26
- RATLIP: Generative Adversarial CLIP Text-to-Image Synthesis Based on Recurrent Affine Transformations (2024)Chengde Lin et al.2.21
- Achieving Precise Text-To-Cypher Via Grounded Knowledge Graph Data Generation (2026)Francesco Cazzaro et al.2.00
- Adapting Mllms For Nuanced Video Retrieval (2026)Piyush Bagad, Andrew Zisserman2.00
- Vision Smolmamba: Spike-guided Token Pruning For Energy-efficient Spiking State-space Vision Models (2026)Dewei Bai, Hongxiang Peng, Yunyun Zeng, et al.2.00
- Scaling Video Understanding Via Compact Latent Multi-agent Collaboration (2026)Kerui Chen, Jinglu Wang, Jianrong Zhang, et al.2.00
- Golden RPG: Confidence-adaptive Region-aware Noise For Compositional Text-to-image Generation (2026)Hao Li2.00
- VERTIGO: Visual Preference Optimization For Cinematic Camera Trajectory Generation (2026)Mengtian Li, Yuwei Lu, Feifei Li, et al.2.00
- Beyond Few-Step Inference: Accelerating Video Diffusion Transformer Model Serving with Inter-Request Caching Reuse (2026)Hao Liu et al.1.89
- LegoDiffusion: Micro-Serving Text-to-Image Diffusion Workflows (2026)Lingyun Yang et al.1.89
- When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models (2026)Zhengyang Sun et al.1.89
- InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation (2026)Zhefan Rao et al.1.89
- ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis (2026)Zhengwentai Sun et al.1.89
- Text-Guided Texturing by Synchronized Multi-View Diffusion (2023)Yuxin Liu et al.1.87