← all papers Β· overview

A Multi-Level Expressive Voice Cloning Method Based on Adaptive Grouped Code Modeling

Yan ZhuΒ·Rui ZhouΒ·Gang ChenΒ·2026

Abstract

Personalized voice cloning increasingly requires not only high speaker fidelity but also fine-grained control over rhythm, pitch, intensity, and expressive prosody. However, many existing systems rely on fixed token grouping or structurally static representations, which can limit fine-scale prosodic control, structural adaptability, and robustness across speakers, languages, and speaking styles. This study presents Adaptive Grouped Voice Cloning (AGVC), a unified framework for expressive and prosody-controllable voice cloning through multi-level contextual modeling and adaptive latent grouping. AGVC combines Multi-Head Self-Attention (MHSA) for long-range content-dependent context modeling, a Bidirectional Long Short-Term Memory (BiLSTM) branch for local temporal continuity, and a coupling-based Normalizing Flow (NF) module for invertible modeling of acoustic distributions. At the core of AGVC, Adaptive Grouped Code Modeling (AGCM) adaptively determines grouping granularity according to local rhythmic and expressive variation in the latent sequence, thereby improving prosodic alignment and style consistency without relying on explicit duration alignment or large-scale phoneme annotation. Experiments on a speaker-disjoint stratified split of Common Voice 13.0 with four main style-language test subsets, together with a supplementary Childlike evaluation, show that AGVC reduces Mel Cepstral Distortion (MCD) by 6.5% relative to VALL-E 2 and Fundamental-frequency Root-Mean-Square Error (F0 RMSE) by 14.0% relative to OpenVoice V2, while maintaining competitive real-time factors under matched settings. Human listening tests further show that AGVC achieves the strongest overall perceptual performance and remains statistically comparable to CosyVoice within the reported confidence intervals, while model-as-a-judge evaluation yields mean win-rates of 0.64, 0.71, and 0.56 against OpenVoice V2, FastSpeech 2, and VALL-E 2, respectively.

Related papers