Awesome Speech Audio

📄Papers 🧭Topics 🔥Trending 🗺️Map 🏆Leaderboards 🎓Learn 🤖Ask AI

⋯More

👥Authors 📚Reading Packs 📊Datasets 🛠️Tools 📰News 📝Blogs ✉️Newsletter 🔖Saved

← all topics overview

Voice Cloning

loading…

Stay Updated

Submit a paper · Privacy · Terms

© 2026 Awesome Papers.

Awesome Voice Cloning — curated papers, datasets & benchmarks · Awesome Speech Audio

← all topics overview

Awesome Voice Cloning

Voice Cloning is one of the most active areas in Awesome Speech Audio — 808 papers in this collection, evaluated on datasets like VCTK, LibriTTS, English. A strong starting point is "Emotional Voice Conversion: Theory, Databases And ESD".

Datasets & benchmarks

VCTK10 papers · 🤗

LibriTTS7 papers

English6 papers

AISHELL-34 papers · 🤗

UA-Speech4 papers

CodecFake+4 papers

LibriSpeech3 papers · 🤗

Common Voice2 papers · 🤗

DEEP-VOICE2 papers · 🤗

Google Speech Commands2 papers

Key papers

60 papers · trending (default)numbers = 🔥 heat

Emotional Voice Conversion: Theory, Databases And ESD (2021)
Kun Zhou, Berrak Sisman, Rui Liu, et al.
16.30
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction (2025)
Qian Chen et al.
11.29
Voxtral TTS (2026)
Mistral-AI: Alexander H. Liu et al.
10.65
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder (2025)
Bowen Zhang et al.
9.94
Identity-based Patterns In Deep Convolutional Networks: Generative Adversarial Phonology And Reduplication (2020)
Gašper Beguš
5.84
The Singing Voice Conversion Challenge 2025: From Singer Identity Conversion To Singing Style Conversion (2025)
Lester Phillip Violeta et al.
5.57
Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages (2025)
Chin-Jou Li et al.
5.35
Improving Low-Resource Dialect Classification Using Retrieval-based Voice Conversion (2025)
Lea Fischbach et al.
5.04
DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility (2025)
Yifan Liu et al.
4.82
AnyEnhance: A Unified Generative Model with Prompt-Guidance and Self-Critic for Voice Enhancement (2025)
Junan Zhang et al.
4.71
Anonymising Elderly And Pathological Speech: Voice Conversion Using DDSP And Query-by-example (2024)
Suhita Ghosh, Melanie Jouaiti, Arnab Das, et al.
4.52
Enhancing Flow Matching with A Unified Guidance Framework for Efficient and Robust Speech Synthesis (2026)
Zuda Yu et al.
4.39
Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation (2026)
Ziyu Zhang et al.
4.33
CrossAccent-TTS: Cross-Lingual Accent-Intensity Controllable Text-to-Speech via Disentangled Speaker and Accent Representations (2026)
Ram Annamdevula et al.
4.33
Joint Residual Reweighting for Classifier Free Guidance in Flow-Matching Zero-Shot TTS (2026)
Runwu Shi et al.
4.33
Generative AI and Copyright Infringement: A Legal-Technical Analysis of AI Music Generation Systems Under 17 U.S.C. Title 17 (2026)
Zuhaib Hussain Butt
4.33
VoiceTTA: Enhancing Zero-Shot Text-to-Speech via Reinforcement Learning-Based Test-Time Adaptation (2026)
Tianxin Xie et al.
4.33
Optimized graph convolutional shunted self-attention neural network for multilingual speech-to-text training using cross-language voice conversion of speech representations (2026)
Selvan Chinnaiyan et al.
4.20
Content-Aware Style Augmentation for Zero-Shot Voice Conversion With Short Target Speech (2026)
Hyeonjin Cha et al.
4.20
Unispeaker: A Unified Approach for Multimodality-driven Speaker Generation (2025)
Zhengyan Sheng and Zhihao Du and Heng Lu and Shiliang Zhang and Zhen-Hua Ling
4.19
Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis (2025)
Marc-Andr\'e Carbonneau et al.
3.86
Speech-dependent Data Augmentation for Own Voice Reconstruction with Hearable Microphones in Noisy Environments (2024)
Mattes Ohlenbusch et al.
3.80
Investigating self-supervised features for expressive, multilingual voice conversion (2025)
\'Alvaro Mart\'in-Cortinas et al.
3.75
kNN-SVC: Robust Zero-Shot Singing Voice Conversion with Additive Synthesis and Concatenation Smoothness Optimization (2025)
Keren Shao et al.
3.70
Enhancing Expressive Voice Conversion with Discrete Pitch-Conditioned Flow Matching Model (2025)
Jialong Zuo et al.
3.59
CycleFlow: Leveraging Cycle Consistency in Flow Matching for Speaker Style Adaptation (2025)
Ziqi Liang et al.
3.53
Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron (2025)
Kishor Kayyar Lakshminarayana and Frank Zalkow and Christian Dittmar and Nicola Pia and Emanuel A.P. Habets
3.53
Unsupervised Rhythm and Voice Conversion of Dysarthric to Healthy Speech for ASR (2025)
Karl El Hajal et al.
3.53
Everyone-Can-Sing: Zero-Shot Singing Voice Synthesis and Conversion with Speech Reference (2025)
Shuqi Dai et al.
3.53
Coding Speech through Vocal Tract Kinematics (2024)
Cheol Jun Cho et al.
3.20
HISPASpoof: A New Dataset For Spanish Speech Forensics (2025)
Maria Risques and Kratika Bhagtani and Amit Kumar Singh Yadav and Edward J. Delp
3.04
Face2VoiceSync: Lightweight Face-Voice Consistency for Text-Driven Talking Face Generation (2025)
Fang Kang et al.
2.93
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play (2025)
Yemin Shi et al.
2.82
ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech (2025)
Yu Pan et al.
2.82
Prosody-Adaptable Audio Codecs for Zero-Shot Voice Conversion via In-Context Learning (2025)
Junchuan Zhao et al.
2.82
DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech (2025)
Deok-Hyeon Cho et al.
2.82
ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis (2025)
Hawau Olamide Toyin et al.
2.82
Few-Shot Speech Deepfake Detection Adaptation with Gaussian Processes (2025)
Neta Glazer et al.
2.82
When Humans Growl and Birds Speak: High-Fidelity Voice Conversion from Human to Animal and Designed Sounds (2025)
Minsu Kang et al.
2.82
Collective Learning Mechanism based Optimal Transport Generative Adversarial Network for Non-parallel Voice Conversion (2025)
Sandipan Dhar and Md. Tousin Akhter and Nanda Dulal Jana and Swagatam Das
2.76
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens (2025)
Xinsheng Wang et al.
2.71
Singing Voice Conversion with Accompaniment Using Self-Supervised Representation-Based Melody Features (2025)
Wei Chen et al.
2.65
IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System (2025)
Wei Deng et al.
2.65
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation (2025)
Zihan Liu et al.
2.65
Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement (2025)
Qianniu Chen et al.
2.60
FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications (2024)
Hao-Han Guo et al.
2.43
Spoken Language Corpora Augmentation with Domain-Specific Voice-Cloned Speech (2024)
Mateusz Czy\.znikiewicz et al.
2.26
An Empirical Study on Learning Latent Representations for Emotional Speech Synthesis (2026)
Vinh Dang Quang et al.
1.94
DuraMark: Duration-Embedded Watermarking in LLM-based TTS (2026)
Zhenwei Mou et al.
1.94
Dynamic Prosody Prediction in LLM-based TTS for Improving Speaker Similarity (2026)
Zhenwei Mou et al.
1.94
Bridging the SEA Gap: An Initial Benchmark for Neural Audio Codec-Synthesized Speech Deepfakes in South-East Asian Languages (2026)
Orchid Chetia Phukan et al.
1.94
Vibrato Expression Control for Singing Voice Conversion with Improving Independent Control (2026)
Joon-Seung Choi et al.
1.94
An Analysis of the Effectiveness of Synthetic Speech Data for ASR Fine-tuning in Selected Indic Languages (2026)
Sujith Pulikodan et al.
1.94
ZONOS2 Technical Report (2026)
Gabriel Clark et al.
1.94
Exploring Cross-Lingual Voice Conversion Methods for Anonymizing Low-Resource Text-to-Speech (2026)
Shenran Wang et al.
1.94
RoCo: Robust Code for Fast and Effective Proactive Defense against Voice Cloning Attack (2026)
Seungmin Kim et al.
1.94
PerTTS: Personalized and Controllable Zero-Shot Spontaneous Style Text-to-Speech Synthesis (2026)
Weiqin Li et al.
1.94
Augmenting Communication Capabilities with Cutting-edge Voice-conversion Technology that Enables Users to Freely Customize the Impression of Their Voice (2026)
Hirokazu Kameoka
1.94
Real Time Translation and Emotional Intelligent Voice Model (2026)
Nooh C. H.
1.94
Source Speech Reconstruction for Many-to-Many and One-to-One Voice Conversion (2026)
Zbyněk Lička et al.
1.94