Avtenet: A Human-cognition-inspired Audio-visual Transformer-based Ensemble Network For Video Deepfake Detection
2023 Β· Ammarah Hashmi, Sahibzada Adil Shahzad, Chia-Wen Lin, et al.
Abstract
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries. Most previous studies on detecting artificial intelligence-generated fake videos only utilize visual modality or audio modality. While some methods exploit audio and visual modalities to detect forged videos, they have not been comprehensively evaluated on multimodal datasets of deepfake videos involving acoustic and visual manipulations, and are mostly based on convolutional neural networks with low detection accuracy. Considering that human cognition instinctively integrates multisensory information including audio and visual cues to perceive and interpret content and the success of transformer in various fields, this study introduces the audio-visual transformer-based ensemble network (AVTENet). This innovative framework tackles the complexities of deepfake technology by integrating both acoustic and visual manipulations to enhance the accuracy of video forger
Authors
(none)
Tags
Stats
Related papers
- ERF-BA-TFD+: A Multimodal Model For Audio-visual Deepfake Detection (2025)2.26
- Multi-modal Deepfake Detection And Localization With Fpn-transformer (2025)2.23
- MFAAN: Unveiling Audio Deepfakes With A Multi-feature Authenticity Network (2023)7.81
- Straight Through Gumbel Softmax Estimator Based Bimodal Neural Architecture Search For Audio-visual Deepfake Detection (2024)5.84
- AUDETER: A Large-scale Dataset For Deepfake Audio Detection In Open Worlds (2025)0.00
- Adversarial Attacks On Audio Deepfake Detection: A Benchmark And Comparative Study (2025)0.00
- Investigating Self-supervised Representations For Audio-visual Deepfake Detection (2025)0.00
- Vulnerability Of Automatic Identity Recognition To Audio-visual Deepfakes (2023)6.77