Awesome Speech Audio

📄Papers 🧭Topics 🔥Trending 🗺️Map 🏆Leaderboards 🎓Learn 🤖Ask AI

⋯More

👥Authors 📚Reading Packs 📊Datasets 🛠️Tools 📰News 📝Blogs ✉️Newsletter 🔖Saved

← all topics overview

Music Generation

loading…

Stay Updated

Submit a paper · Privacy · Terms

© 2026 Awesome Papers.

Awesome Music Generation — curated papers, datasets & benchmarks · Awesome Speech Audio

← all topics overview

Awesome Music Generation

Music Generation is one of the most active areas in Awesome Speech Audio — 1,056 papers in this collection, evaluated on datasets like MAESTRO, Slakh-2100, AudioCaps. A strong starting point is "DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion".

Datasets & benchmarks

MAESTRO8 papers · 🤗

Slakh-21008 papers

AudioCaps7 papers · 🤗

GTZAN5 papers · 🤗

MusicCaps5 papers · 🤗

MUSDB184 papers · 🤗

MUSDB-18-HQ4 papers

MusicBench3 papers · 🤗

MoisesDB3 papers · 🤗

LibriTTS3 papers

CocoChorales Dataset3 papers

Mandarin dataset3 papers

Key papers

60 papers · trending (default)numbers = 🔥 heat

DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion (2025)
Ziqian Ning et al.
10.21
Generative Adversarial Phonology: Modeling Unsupervised Phonetic And Phonological Learning With Neural Networks (2020)
Gašper Beguš
10.07
MoonCast: High-Quality Zero-Shot Podcast Generation (2025)
Zeqian Ju et al.
8.52
DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation (2025)
Jiaqi Li et al.
8.34
MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation (2026)
Deguo Xia et al.
8.18
FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation (2025)
Yuxuan Jiang et al.
7.86
SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement (2025)
Chenyu Yang et al.
7.80
A Syllable-structured, Contextually-based Conditionally Generation Of Chinese Lyrics (2019)
Xu Lu, Jie Wang, Bojin Zhuang, et al.
7.16
TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis (2025)
Yu Zhang et al.
7.13
MusicGen-Stem: Multi-stem music generation and edition through autoregressive modeling (2025)
Simon Rouard et al.
7.08
Live Music Models (2025)
Lyria Team: Antoine Caillon et al.
6.18
ReaLJam: Real-Time Human-AI Music Jamming with Reinforcement Learning-Tuned Transformers (2025)
Alexander Scarlatos et al.
6.12
AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder (2025)
Samir Sadok et al.
6.06
Identity-based Patterns In Deep Convolutional Networks: Generative Adversarial Phonology And Reduplication (2020)
Gašper Beguš
5.84
The Singing Voice Conversion Challenge 2025: From Singer Identity Conversion To Singing Style Conversion (2025)
Lester Phillip Violeta et al.
5.57
Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation (2025)
Pengchao Feng et al.
5.35
DPN-GAN: Inducing Periodic Activations in Generative Adversarial Networks for High-Fidelity Audio Synthesis (2025)
Zeeshan Ahmad et al.
5.35
Residual Shuffle-exchange Networks For Fast Processing Of Long Sequences (2020)
Andis Draguns, Emīls Ozoliņš, Agris Šostaks, et al.
5.24
Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders (2025)
Nathan Paek et al.
5.21
CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages (2025)
Shangda Wu et al.
5.18
TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching (2025)
Wenxiang Guo et al.
4.76
OMAR-RQ: Open Music Audio Representation Model Trained with Multi-Feature Masked Token Prediction (2025)
Pablo Alonso-Jim\'enez and Pedro Ramoneda and R. Oguz Araz and Andrea Poltronieri and Dmitry Bogdanov
4.53
A Text-Steerable Instrument for Sketching Procedural Soundscapes via Language Models (2026)
Prabal Gupta (Rama Labs et al.
4.39
An Efficient vLLM-Based Inference Pipeline for Unified Audio Understanding and Generation (2026)
Haoran Wang et al.
4.39
Calliope: An Online Generative Music System for Symbolic Multi-Track Composition (2025)
Renaud Bougueng Tchemeube et al.
4.36
Stepwise Reasoning Enhancement for LLMs via External Subgraph Generation (2026)
Xin Zhang et al.
4.33
Differentiable Articulatory Copy-Synthesis of Biphonic Singing (2026)
Mateo C\'amara et al.
4.33
Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation (2026)
Ziyu Zhang et al.
4.33
Generative Modeling of Bach-Style Symbolic Music: A Comparative Study of Autoregressive, Latent-Variable, and Adversarial Approaches (2026)
Dezhi Yu et al.
4.33
Continuous Audio Thinking for Large Audio Language Models (2026)
Gyojin Han et al.
4.33
Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation (2026)
Jan Cegin et al.
4.33
MagpieTTS-LF: Inference-Time Long-Form Speech Generation Without Training on Long-Form data (2026)
Subhankar Ghosh et al.
4.33
Closing the Loop: PID Feedback Control for Interpretable Activation Steering in Symbolic Music Generation (2026)
Ioannis Prokopiou et al.
4.33
Audio-to-Audio via Diffusion Warm Initialization (2026)
Crist\'obal Andrade et al.
4.33
Error-Aware TF-IDF Retrieval-Augmented Generation for ASR Error Correction (2026)
Mohammad Aref Jafari-Raddani
4.33
Security and Privacy in Retrieval-Augmented Generation: Architectures, Threats, Defenses, and Future Directions for Building Trustworthy Systems (2026)
Balamurugan Palanisamy et al.
4.33
Frequency-Aware Self-Supervised Music Representation Learning (2026)
Yicheng Gu et al.
4.33
Generative AI and Copyright Infringement: A Legal-Technical Analysis of AI Music Generation Systems Under 17 U.S.C. Title 17 (2026)
Zuhaib Hussain Butt
4.33
Serenade: A Singing Style Conversion Framework Based On Audio Infilling (2025)
Lester Phillip Violeta et al.
4.30
Musical Attention Transformer: Music Generation Using a Music-Specific Attention Model (2026)
Shinnosuke Taksuka and Hideo Mukai
4.27
Music Transcription with (Almost) No Supervision (2026)
Saebyeol Shin et al.
4.27
MERIT: Learning Disentangled Music Representations for Audio Similarity (2026)
Abhinaba Roy et al.
4.27
Conditional Wavegan (2018)
Chae Young Lee, Anoop Toffy, Gue Jun Jung, et al.
4.22
Unispeaker: A Unified Approach for Multimodality-driven Speaker Generation (2025)
Zhengyan Sheng and Zhihao Du and Heng Lu and Shiliang Zhang and Zhen-Hua Ling
4.19
MIDI-GPT: A Controllable Generative Model for Computer-Assisted Multitrack Music Composition (2025)
Philippe Pasquier et al.
4.19
Deepfake Detection of Singing Voices With Whisper Encodings (2025)
Falguni Sharma et al.
4.19
TADA! Tuning Audio Diffusion Models through Activation Steering (2026)
{\L}ukasz Staniszewski et al.
4.08
AnyAccomp: Generalizable Accompaniment Generation via Quantized Melodic Bottleneck (2025)
Junan Zhang et al.
3.97
A Mamba-based Network for Semi-supervised Singing Melody Extraction Using Confidence Binary Regularization (2025)
Xiaoliang He and Kangjie Dong and Jingkai Cao and Shuai Yu and Wei Li and Yi Yu
3.75
ReverbFX: A Dataset of Room Impulse Responses Derived from Reverb Effect Plugins for Singing Voice Dereverberation (2025)
Julius Richter et al.
3.75
kNN-SVC: Robust Zero-Shot Singing Voice Conversion with Additive Synthesis and Concatenation Smoothness Optimization (2025)
Keren Shao et al.
3.70
LZMidi: Compression-Based Symbolic Music Generation (2025)
Connor Ding et al.
3.64
High-Fidelity Music Vocoder using Neural Audio Codecs (2025)
Luca A. Lanzend\"orfer et al.
3.59
WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models (2025)
Yifu Chen et al.
3.59
A Recurrent Connectionist Model Of Melody Perception : An Exploration Using TRACX2 (2023)
Daniel Defays, Robert French, Barbara Tillmann
3.58
Synthesising Handwritten Music With Gans: A Comprehensive Evaluation Of Cyclewgan, Progan, And DCGAN (2024)
Elona Shatri, Kalikidhar Palavala, George Fazekas
3.58
CSL-L2M: Controllable Song-Level Lyric-to-Melody Generation Based on Conditional Transformer with Fine-Grained Lyric and Musical Controls (2024)
Li Chai and Donglin Wang
3.53
Everyone-Can-Sing: Zero-Shot Singing Voice Synthesis and Conversion with Speech Reference (2025)
Shuqi Dai et al.
3.53
TA-RAG: Tone-Aware Retrieval-Augmented Generation for Peer-Support Health Communication (2026)
Yong-Bin Kang et al.
3.45
Mood-Aware Music Recommendation: Integrating User Affective Signals into Ranking Systems (2026)
Terence Zeng et al.
3.45