AI-Generated Voice, Synthetic Speech, and Voice Cloning: Scoping Review with ☸️SAIMSARA

Abstract

To map the original research literature on AI-generated voice, identify the most query-relevant recurring finding, and synthesize major research topics, practical implications, limitations, and future directions across technical, human-centered, clinical, educational, security, and societal domains. The review utilises 226 original studies with 3297311 total participants (topic deduplicated ΣN). This scoping review suggests that AI-generated voice has reached a level of realism and social utility sufficient to support meaningful applications across education, healthcare, and accessibility, while simultaneously outpacing unaided human ability to distinguish synthetic from authentic speech, with listener accuracy reported as low as 37.5% in vishing-style clips. The dominant signal is a widening gap between human perceptual limits and the demonstrated, though dataset-specific, capability of automated detectors reaching above 99% accuracy in constrained settings. This convergence highlights that safe deployment depends less on any single performance metric than on layered safeguards combining provenance, explainable detection, and authentication. Generalizability remains constrained by heterogeneous benchmarks and small human studies. Future research should prioritize standardized multilingual, adversarial, real-time evaluation alongside enforceable consent and provenance frameworks for voice cloning.

Abstract

Related papers