PITCH: AI-assisted Tagging of Deepfake Audio Calls using Challenge-Response

Abstract

The rise of AI voice-cloning technology, particularly audio Real-time Deepfakes (RTDFs), has intensified social engineering attacks by enabling real-time voice impersonation that bypasses conventional enrollment-based authentication. This technology represents an existential threat to phone-based authentication systems, while total identity fraud losses reached $43 billion. Unlike traditional robocalls, these personalized AI-generated voice attacks target high-value accounts and circumvent existing defensive measures, creating an urgent cybersecurity challenge. To address this, we propose PITCH, a robust challenge-response method to detect and tag interactive deepfake audio calls. We developed a comprehensive taxonomy of audio challenges based on the human auditory system, linguistics, and environmental factors, yielding 20 prospective challenges. Testing against leading voice-cloning systems using a novel dataset (18,600 original and 1.6 million deepfake samples from 100 users), PITCH's challenges enhanced machine detection capabilities to 88.7% AUROC score, enabling us to identify 10 highly-effective challenges. For human evaluation, we filtered a challenging, balanced subset on which human evaluators independently achieved 72.6% accuracy, while machines scored 87.7%. Recognizing that call environments require human control, we developed a novel human-AI collaborative system that tags suspicious calls as "Deepfake-likely." Contrary to prior findings, we discovered that integrating human intuition with machine precision offers complementary advantages, giving users maximum control while boosting detection accuracy to 84.5%. This significant improvement situates PITCH's potential as an AI-assisted pre-screener for verifying calls, offering an adaptable approach to combat real-time voice-cloning attacks while maintaining human decision authority.

Abstract

Related papers