Analysis of Acoustic Anomalies and Speech Artefacts in Synthetic Content

Abstract

Purpose: The purpose of the study was to empirically confirm that acoustic anomalies and speech artefacts may constitute interpretable and robust descriptors used to detect audio deepfakes. The study focused on identifying typical deviations in voice, prosody and acoustic spectrum parameters, which result from synthetic speech generation by TTS models and voice conversion, in particular in the sound source and articulation track mirroring layers. The following stage was to assess the stability of these characteristics in realistic distribution of the content (in the wild), including recompression, variable band and typical noise. Project and methods: A full framework was developed for unification, extraction and selection of acoustic characteristics, independent of classifiers. The analysis included the impact of the signal to noise ratio (SNR), which determines the quality of audio recording, where a low SNR value indicates strong impact of the background noise and significantly decreases the effectiveness of phase, cepstral and modulation characteristics. 46,371 clips from the DeepFake RealWorld (DFRW) set were analysed, which includes authentic and synthetic recordings, generated using various technologies (GAN, diffusion models, TTS, voice conversion). Five descriptor families were defined: tonal-glottal, cepstral-spectral, phase, energy-dynamic and prosodic-modulational. The selection was completed without using neural networks, using differential factors Δp = p_df − p_real and PR = p_df / p_real with thresholds Δp ≥ 0.15 or PR ≥ 1.5 and validity control FDR (q < 5%). Results: The analysis revealed significant differences between authentic and synthetic speech. The highest differentiation effectiveness was obtained for characteristics LFCC, CQCC and MFCC (Δp to 0.25; PR ≈ 1.6–1.8), which maintained stability after degradations typical of social medial. The jitter/ shimmer, HNR/CPP and modulation characteristics showed smoothing of prosody and excessive voice regularity (Δp ≈ 0.17–0.23). Phase characteristics were useful in detecting harmonic discontinuities, however their effectiveness dropped at low SNR. The combination of acoustic analysis with audio-video synchronisation metrics (LSE-C/LSE-D) increased the resistance to single modality disturbance attacks. Conclusions: The identified speech anomalies and artefacts are a credible and interpretable foundation of audio deepfake detection. The results have a direct application value for public safety and civil protection, because they enable building an auditable audio content layer for voice impersonation and message manipulations. In operating scenarios, such as crisis communication of public institutions, verification of authenticity of recordings disseminated in social media and analysis of sociotechnical incidents, interpretable descriptors may shorten the triage time, support the early warning and reduce the escalation risk of voice misinformation. They can be used as basis for hybrid forensic systems that combine classic acoustic descriptors with deep learning models to ensure interpretability and resistance to technological drift. The DFRW set and the applied selection method enable a comparable and repeatable evaluation of the effectiveness of characteristics in various distribution conditions. The continuation of the project (DFRWv2) will include database extension to ≥ 500,000 clips and multi-modal audio-video analyses, which will enable standardisation of the reporting of indicators Δp, PR, p_real, p_df and 95% CI in forensic studies and security engineering.

Abstract

Related papers