Abstract
Voice cloning enables realistic fake speech in which a speaker’s identity is preserved while the spoken message is semantically altered. This paper asks whether such meaning-level manipulation leaves detectable traces in transcripts alone. To study this problem, we introduce FakeSpeech+, a paired real–fake dataset built from authentic speech clips and their matched semantically altered counterparts, re-embedded into cloned voices while preserving speaker identity. Using this dataset, we conduct a transcript-first analysis based on interpretable text-only features from two groups: (i) linguistic content organization and discourse dynamics, and (ii) compact production-related proxy cues, including hesitation and disfluency markers. We evaluate these cues under transcript-length control through residualization and compare authentic and manipulated transcripts using statistical and experimental analyses. The results show that only a limited subset of features retains strong separation after length control, with coordination-related structure and emotion anchoring emerging as the clearest cues, while several production-related and discourse-variability features show weaker but still informative differences. In contrast, a number of syntactic, lexical-diversity, and other discourse-level features show substantial overlap after residualization. These findings indicate that transcript-level structure and selected production-related cues remain informative under realistic content-manipulation threats, supporting the value of transcript-based analysis for identity-preserving fake speech.