Artificial Intelligence Based ECAPA-TDNN Voice Cloning

Abstract

Voice authentication is now crucial to telephone-based financial services. Voice biometrics are used by banks and other organizations to confirm the identity of users. However, the rapid advancement of AI-powered voice cloning technology presents new security threats to these systems. This study investigates the possibility of evading voice authentication through the use of F5-TTS v1 to generate synthetic speech. Original recordings from four speakers (three male and one female) were collected and cloned versions of their voices were created from those original recordings. Both the real and synthetic samples were tested against three widely used speaker-verification systems: ECAPA-TDNN, ESPnet and Resemblyzer. With the original voice clips, all three models performed flawlessly with 100% accuracy. But the introduction of cloned voices changed the results. Both Resemblyzer and ECAPA-TDNN assigned high similarity scores to the synthetic samples (0.86 and 0.80), leading to false acceptances. ESPnet performed better. It gave the cloned samples lower similarity scores and rejected them more often. This difference in results highlights how susceptible some current voice authentication systems are to modern synthetic speech.

Abstract

Related papers