Abstract
The latest development of speech synthesis models and voice conversion models has contributed greatly to the realism of audio deepfakes, posing a grave threat associated with identity fraud, fake news, and computer security. To overcome this difficulty, this paper suggests an audio deepfake detector framework based on deep learning that involves Mel-spectrogram representations and a small convolutional neural network based on transfer learning. The suggested system converts audio signals into time-frequency representation and utilizes a fine-tuned MobileNetV2 model to perform binary classification of fake and legitimate audio samples. An audio preprocessing, model inference, and result visualization application architecture based on Streamlit is a monolithic application architecture. As experimental testing on the York University audio deepfake dataset shows, with an accuracy of 91.92, computationally efficient CNN models can indeed be used in real-time to detect audio deepfakes.