Pitch Imperfect: Detecting Audio Deepfakes Through Acoustic Prosodic Analysis
2025 Β· Kevin Warren, Daniel Olszewski, Seth Layton, et al.
Abstract
Audio deepfakes are increasingly in-differentiable from organic speech, often fooling both authentication systems and human listeners. While many techniques use low-level audio features or optimization black-box model training, focusing on the features that humans use to recognize speech will likely be a more long-term robust approach to detection. We explore the use of prosody, or the high-level linguistic features of human speech (e.g., pitch, intonation, jitter) as a more foundational means of detecting audio deepfakes. We develop a detector based on six classical prosodic features and demonstrate that our model performs as well as other baseline models used by the community to detect audio deepfakes with an accuracy of 93% and an EER of 24.7%. More importantly, we demonstrate the benefits of using a linguistic features-based approach over existing models by applying an adaptive adversary using an \(L_\{\infty\}\) norm attack against the detectors and using attention mechanisms in o
Authors
(none)
Tags
Stats
Related papers
- Combining Automatic Speaker Verification And Prosody Analysis For Synthetic Speech Detection (2022)10.48
- Adversarial Attacks On Audio Deepfake Detection: A Benchmark And Comparative Study (2025)0.00
- MFAAN: Unveiling Audio Deepfakes With A Multi-feature Authenticity Network (2023)7.81
- Securing Voice Biometrics: One-shot Learning Approach For Audio Deepfake Detection (2023)9.03
- Self-attention And Hybrid Features For Replay And Deep-fake Audio Detection (2024)0.00
- Anomaly Detection And Localization For Speech Deepfakes Via Feature Pyramid Matching (2025)4.52
- Detection Of Cross-dataset Fake Audio Based On Prosodic And Pronunciation Features (2023)0.00
- Zero-day Audio Deepfake Detection Via Retrieval Augmentation And Profile Matching (2025)0.00