Abstract
Voice cloning technology has developed rapidly and can currently produce high-quality humanlike voices from as little as 10 s of speech. It is unclear whether cloned voices are as intelligible as their human originals. We compared the intelligibility of ten human voices with their ten voice clones in background noise. Eighty participants listened to 80 sentences (40 human, 40 cloned), presented in four signal-to-noise ratios (+3, 0, β3, and β6βdB) in an online experiment. Cloned voices were up to 13.4% more intelligible than their human counterparts across all noise levels. Principal component analysis with linear discriminant analysis classified human and cloned voices correctly in 79.4% of cases based on an extensive set of acoustic measurements, confirming systematic acoustic differences between the two voice types. Human listeners identified human voices with 70.4% accuracy. Elastic net regression analyses indicated that intelligibility in cloned voices was driven mainly by pitch and harmonic measures, whereas formant- and vowel-space measures were more influential for human voices. Our findings have implications for applications of voice cloning, including voice restoration, speech synthesis for non-verbal individuals, and accessibility for people with hearing loss.