Multi-modal Deepfake Detection And Localization With Fpn-transformer
2025 Β· Chende Zheng, Ruiqi Suo, Zhoulin Ji, et al.
Abstract
The rapid advancement of generative adversarial networks (GANs) and diffusion models has enabled the creation of highly realistic deepfake content, posing significant threats to digital trust across audio-visual domains. While unimodal detection methods have shown progress in identifying synthetic media, their inability to leverage cross-modal correlations and precisely localize forged segments limits their practicality against sophisticated, fine-grained manipulations. To address this, we introduce a multi-modal deepfake detection and localization framework based on a Feature Pyramid-Transformer (FPN-Transformer), addressing critical gaps in cross-modal generalization and temporal boundary regression. The proposed approach utilizes pre-trained self-supervised models (WavLM for audio, CLIP for video) to extract hierarchical temporal features. A multi-scale feature pyramid is constructed through R-TLM blocks with localized attention mechanisms, enabling joint analysis of cross-context t
Authors
(none)
Tags
Stats
Related papers
- Anomaly Detection And Localization For Speech Deepfakes Via Feature Pyramid Matching (2025)4.52
- Avtenet: A Human-cognition-inspired Audio-visual Transformer-based Ensemble Network For Video Deepfake Detection (2023)7.50
- ERF-BA-TFD+: A Multimodal Model For Audio-visual Deepfake Detection (2025)2.26
- Straight Through Gumbel Softmax Estimator Based Bimodal Neural Architecture Search For Audio-visual Deepfake Detection (2024)5.84
- MFAAN: Unveiling Audio Deepfakes With A Multi-feature Authenticity Network (2023)7.81
- Transsionadd: A Multi-frame Reinforcement Based Sequence Tagging Model For Audio Deepfake Detection (2023)0.00
- Investigating Self-supervised Representations For Audio-visual Deepfake Detection (2025)0.00
- Heterogeneity Over Homogeneity: Investigating Multilingual Speech Pre-trained Models For Detecting Audio Deepfake (2024)8.09