Single And Multi-speaker Cloned Voice Detection: From Perceptual To Learned Features
2023 Β· Sarah Barrington, Romit Barua, Gautham Koorma, et al.
Abstract
Synthetic-voice cloning technologies have seen significant advances in recent years, giving rise to a range of potential harms. From small- and large-scale financial fraud to disinformation campaigns, the need for reliable methods to differentiate real and synthesized voices is imperative. We describe three techniques for differentiating a real from a cloned voice designed to impersonate a specific person. These three approaches differ in their feature extraction stage with low-dimensional perceptual features offering high interpretability but lower accuracy, to generic spectral features, and end-to-end learned features offering less interpretability but higher accuracy. We show the efficacy of these approaches when trained on a single speaker's voice and when trained on multiple voices. The learned features consistently yield an equal error rate between 0% and 4%, and are reasonably robust to adversarial laundering.
Authors
(none)
Tags
Stats
Related papers
- Securing Voice-driven Interfaces Against Fake (cloned) Audio Attacks (2019)9.92
- Neural Voice Cloning With A Few Samples (2018)0.00
- Defense Against Synthetic Speech: Real-time Detection Of RVC Voice Conversion Attacks (2025)0.00
- Data Efficient Voice Cloning For Neural Singing Synthesis (2019)10.07
- One-class Learning Towards Synthetic Voice Spoofing Detection (2020)17.31
- Toward Improving Synthetic Audio Spoofing Detection Robustness Via Meta-learning And Disentangled Training With Adversarial Examples (2024)6.77
- Voice Cloning: A Multi-speaker Text-to-speech Synthesis Approach Based On Transfer Learning (2021)0.00
- Securing Voice Biometrics: One-shot Learning Approach For Audio Deepfake Detection (2023)9.03