ViT-AUDIT: Spectrogram-based Deepfake Audio Detection using Vision Transformers

Abstract

The rapid advancement of artificial intelligence has enabled the generation of highly realistic synthetic speech, raising serious concerns for voice-based authentication and communication systems. Detecting deepfake audio is challenging due to subtle spectrotemporal inconsistencies that are difficult to capture using conventional techniques. This paper proposes ViT-AUDIT, a spectrogram-based deep learning framework that leverages Mel Frequency Cepstral Coefficients (MFCCs) and multiple neural architectures, including CNN, RNN, CRNN, LSTM-GRU, and Vision Transformers (ViT). The audio data undergo preprocessing steps such as normalization, resampling, silence removal, and mono conversion, followed by feature extraction and standardization. Experiments conducted on a labeled audio deepfake dataset demonstrate that the Vision Transformer model achieves superior performance, with 95.10% accuracy, 95.45% precision, 94.70% recall, and 95.07% F1-score. The results highlight the effectiveness of transformer-based models in capturing complex audio patterns, making the proposed system a reliable solution for real-world deepfake audio detection.

Abstract

Related papers