Awesome Audio Generation
Audio Generation is one of the most active areas in Awesome Generative Models β 731 papers in this collection, evaluated on datasets like ImageNet, COCO, CIFAR-10. A strong starting point is "InterleaveThinker: Reinforcing Agentic Interleaved Generation".
Datasets & benchmarks
Key papers
- InterleaveThinker: Reinforcing Agentic Interleaved Generation (2026)Dian Zheng et al.14.38
- OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains (2026)Xinyue Cai et al.10.83
- Avatar V: Scaling Video-Reference Avatar Video Generation (2026)Benjamin Liang et al.7.85
- Latent Space Super-Resolution for Higher-Resolution Image Generation
with Diffusion Models (2025)Jinho Jeong et al.7.82
- A Review on Generative AI For Text-To-Image and Image-To-Image
Generation and Implications To Scientific Images (2025)Zineb Sordo and Eric Chagnon and Daniela Ushizima7.64
- Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation (2023)David Junhao Zhang et al.6.55
- Latte: Latent Diffusion Transformer for Video Generation (2024)Xin Ma et al.6.11
- Fast-DDPM: Fast Denoising Diffusion Probabilistic Models for Medical Image-to-Image Generation (2024)Hongxu Jiang et al.5.40
- Projected Coupled Diffusion for Test-Time Constrained Joint Generation (2025)Hao Luan et al.5.03
- VHDLSuite: Unified Pipeline for LLM VHDL Generation with Data Synthesis and Evaluation (2026)Yijun Shen et al.5.01
- Long-horizon Streaming Video Generation Via Hybrid Attention With Decoupled Distillation (2026)Ruibin Li, Tao Yang, Fangzhou Ai, et al.4.89
- Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video
Generation Control (2025)Zekai Gu et al.4.76
- Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think (2024)Sihyun Yu et al.4.60
- Zero Shot Molecular Generation via Similarity Kernels (2024)Rokas Elijo\v{s}ius et al.4.57
- PhytoSynth: Leveraging Multi-modal Generative Models for Crop Disease Data Generation with Novel Benchmarking and Prompt Engineering Approach (2025)Nitin Rai et al.4.47
- MCCD: Multi-Agent Collaboration-based Compositional Diffusion for
Complex Text-to-Image Generation (2025)Mingcheng Li et al.4.47
- Generative Modeling of Bach-Style Symbolic Music: A Comparative Study of Autoregressive, Latent-Variable, and Adversarial Approaches (2026)Dezhi Yu et al.4.39
- Prompt2Effect: Training-Free Image-to-Video Model Specialization via LoRA Generation (2026)Xiaomeng Yang et al.4.39
- Mask, Sample, Revise: A Revisable CTMC Inference Stack for Guided Discrete Flow Matching Text-to-Speech (2026)Alef Iury Siqueira Ferreira et al.4.39
- FoleyGenEx: Unified Video-to-Audio Generation with Multi-Modal Control, Temporal Alignment, and Semantic Precision (2026)Shiyao Wang et al.4.39
- VideoWeave: Unlocking Geometric Consistency in Video Generation via Joint Geometry-Video Modeling (2026)Xunzhi Xiang et al.4.39
- VeriGeo: Controllable Geometry Question Generation with Numerical and Analytical Verification (2026)Xiaoxian Duan et al.4.39
- LapidaryEngine: Fully Conversational Crystal Generation (2026)Yusei Ito et al.4.39
- CausalMotion: Structured Physical Reasoning as Keyframe and Trajectory Guidance for Training-Free Video Generation (2026)Sihan Zhuang et al.4.39
- PepALD: Macrocyclic Peptide Generation via Autoregressive Latent Diffusion (2026)Junming Zhang et al.4.39
- Code Correctness Signals in LLM Hidden States: Pre-Generation Probing and Repair Geometry (2026)Carlo Di Cicco4.39
- A Qualitative Review of GenAI-Based Methods for Data Generation and Augmentation in Industrial Computer Vision Applications (2026)Paul Koch et al.4.39
- Memento: Reconstruct to Remember for Consistent Long Video Generation (2026)Xuan Wei et al.4.39
- Flood and Harvest: The Provable Necessity of Trivia for Generating Valuable Mathematics via the Lens of Language Generation in the Limit (2026)Xiaoyu Li et al.4.39
- Latent Process Generator Matching (2026)Lukas Billera et al.4.33
- Towards Controllable Image Generation through Representation-Conditioned Diffusion Models (2026)Nithesh Chandher Karthikeyan et al.4.33
- Diffusion-Based Ukrainian Handwritten Text Generation with Cross-Domain Style Transfer (2026)Andrii Ahitoliev et al.4.33
- StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation
from Text (2024)Roberto Henschel et al.4.21
- CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects (2024)Zhao Wang et al.4.10
- One-step Latent-free Image Generation with Pixel Mean Flows (2026)Yiyang Lu et al.3.98
- Seeing It Before It Happens: In-Generation NSFW Detection for Diffusion-Based Text-to-Image Models (2025)Fan Yang et al.3.97
- StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization (2025)Gopalji Gaur et al.3.97
- DreamComposer++: Empowering Diffusion Models with Multi-View Conditions for 3D Content Generation (2025)Yunhan Yang et al.3.92
- Denoising Multi-Beta VAE: Representation Learning for Disentanglement and Generation (2025)Anshuk Uppal et al.3.92
- Mamba-Diffusion Model with Learnable Wavelet for Controllable Symbolic
Music Generation (2025)Jincheng Zhang et al.3.81
- Language-Guided Trajectory Traversal in Disentangled Stable Diffusion
Latent Space for Factorized Medical Image Generation (2025)Zahra TehraniNasab et al.3.70
- Face-MakeUp: Multimodal Facial Prompts for Text-to-Image Generation (2025)Dawei Dai et al.3.59
- Structural Energy Guidance for View-Consistent Text-to-3D Generation (2026)Qing Zhang et al.3.45
- Paris 2.0: A Decentralized Diffusion Model for Video Generation (2026)Ali Rouzbayani et al.3.45
- Generation of non-stationary stochastic fields using Generative
Adversarial Networks (2022)Alhasan Abdellatif et al.3.19
- Motion-Zero: Zero-Shot Moving Object Control Framework for
Diffusion-Based Video Generation (2024)Changgu Chen et al.2.92
- ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models (2025)Ozgur Kara et al.2.87
- Deep Generative Model-Based Generation of Synthetic Individual-Specific
Brain MRI Segmentations (2025)Ruijie Wang et al.2.82
- Wavelet-based Variational Autoencoders for High-Resolution Image
Generation (2025)Andrew Kiruluta2.82
- DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture
Design in Text to Image Generation (2025)Chen Chen et al.2.76
- Compressed Image Generation with Denoising Diffusion Codebook Models (2025)Guy Ohayon et al.2.71
- A Mixture-Based Framework for Guiding Diffusion Models (2025)Yazid Janati et al.2.71
- UniForm: A Unified Multi-Task Diffusion Transformer for Audio-Video Generation (2025)Lei Zhao et al.2.71
- HumanDiT: Pose-Guided Diffusion Transformer for Long-form Human Motion Video Generation (2025)Qijun Gan et al.2.71
- Ultrasound Image Generation using Latent Diffusion Models (2025)Benoit Freiche et al.2.71
- CubeDiff: Repurposing Diffusion-Based Image Models for Panorama
Generation (2025)Nikolai Kalischek et al.2.65
- Efficient Generative Modeling with Residual Vector Quantization-Based Tokens (2024)Jaehyeon Kim et al.2.60
- Blenderrag: High-fidelity 3D Object Generation Via Retrieval-augmented Code Synthesis (2026)Massimo Rondelli, Francesco Pivi, Maurizio Gabbrielli2.60
- ZoomLDM: Latent Diffusion Model for multi-scale image generation (2024)Srikar Yellapragada et al.2.54
- Multi-Source Music Generation with Latent Diffusion (2024)Zhongweiyang Xu et al.2.43