Deepfake Image Detection System

Abstract

Rapid advances in deep learning and generative adversarial networks (GANs) have greatly increased the realism of deepfake technologies, raising serious concerns about the integrity of digital media. While visual manipulation has received substantial attention, recent progress in voice cloning and audio synthesis poses equally significant threats by undermining biometric security and enabling identity fraud. This paper presents a comprehensive study of unimodal deepfake detection systems, evaluating two independent pipelines: a convolutional neural network (CNN)-based image detector and audio detectors using Mel-Frequency Cepstral Coefficients (MFCCs) with Support Vector Machines (SVMs) and CNNs applied to spectrogram representations. By excluding multimodal fusion and recurrent architectures, the study isolates unimodal performance and enables a focused comparison across visual and auditory domains. Experimental results show that CNN-based image models achieve over 90 % accuracy on the Celeb-DF v2 dataset with Grad-CAMbased interpretability, while MFCC+SVM and spectrogrambased CNNs reach approximately 97 % and 98 % accuracy on the Fake-or-Real and DEEP-VOICE datasets, respectively. Although unimodal detectors perform well in controlled environments, challenges remain in cross-dataset generalization, noise robustness, and adversarial resilience, and this work establishes a lightweight baseline for future domain-specific deepfake forensic systems.

Abstract

Related papers