Detecting Audio Deepfakes in the Libyan Dialect: A Stacked Ensemble Approach Using MFCCs and Mel-Spectrogram CNNs

Abstract

Audio Deepfakes (AD) are highly realistic fake audio recordings created using AI-based tools that clone human voices. Recent advances in Text-To-Speech (TTS) and Voice Conversion (VC) technologies have made it easier to generate both synthetic and imitative speech. While these technologies were designed to improve people’s lives, they have also been misused by attackers, posing significant risks to public safety. Developing effective algorithms to distinguish fake audio from real audio is therefore critical. Various machine learning (ML) and deep learning (DL) techniques have been created to identify Deepfake audios. However, the reliance on massive training data or excessive pre-processing introduces significant methodological challenges. This paper proposes a hybrid framework for detecting AI-generated speech, integrating a Multi-Layer Perceptron (MLP), a Convolutional Neural Network (CNN), Dense Neural Network (DNN), and an optimized XGBoost classifier within a stacking ensemble. The framework leverages key audio features such as Mel-Frequency Cepstral Coefficients (MFCCs) and their derivatives, along with the average STFT spectrum and Mel-Spectrogram, extracted from the audio recordings, which were collected from real Libyan dialect and generated as Deepfake audio clips using Retrieval-based Voice Conversion (RVC) and Tacotron model. Our dataset consists of three versions containing audio clips lasting three, five, and seven seconds. Extensive experiments demonstrate that the proposed system achieves its highest performance with five-second segments, reaching an accuracy of $98.75%$ , while maintaining strong performance for other durations. These results highlight the benefits of combining DL and traditional ML techniques for robust audio forgery detection.

Abstract

Related papers