← all papers Β· overview

DeepRawNet: empowering deepfake audio detection through dynamic enhancements

Abstract

The generation of deepfake audio poses significant challenges to the reliability and security of automatic speaker verification (ASV)-based systems. ASV systems having applications in fintech, surveillance, home automation, security, etc ., are susceptible to a variety of deepfake/voice cloning attacks, including speech synthesis and voice conversion (VC). Impostors launch audio deepfake attacks on ASV systems to compromise their security and cause financial losses, data breaches, etc . To combat such threats, we propose a robust and generalized audio deepfakes detection framework, DeepRawNet, by processing raw audio waveforms. Specifically, DeepRawNet is the enhanced version of RawNet2, introduces three key innovations: (1) we employ the Parametric Rectified Linear Unit (PReLU) activation in the residual blocks over Leaky ReLU in the RawNet2, introducing a learnable negative slope to enhance adaptive feature extraction, (2) substituting the simple convolution layer with a transpose convolution layer in the residual block addresses downsampling issues while preserving fine-grained temporal information crucial for capturing complex patterns in raw audio, (3) we incorporate the LogSoftmax activation function to stabilize and optimize learning during training and inference. These architectural refinements empower our DeepRawNet model with improved adaptability, robust learning capabilities, and enhanced capacity to capture complex temporal dependencies and discriminative patterns in the audio, making it a more effective solution for audio deepfake detection. We performed a rigorous evaluation of our proposed method on ASVspoof2019-LA and ASVspoof2021-LA/DF datasets, including algorithm-wise and cross-corpora evaluation, an ablation study with different model configurations, and comparison against baseline models and existing approaches. Experimental results highlight the improved performance of DeepRawNet against the ASVspoof baselines, improved generalization across diverse spoofing attacks, particularly for the most challenging VC attacks, and effectiveness in combating deepfake audio threats.

Related papers