The Affective Bridge: Preserving Speech Representations While Enhancing Deepfake Detection Vian Emotional Constraints

·2025

arXiv:li2025the ↗Google Scholar ↗Semantic Scholar ↗

Abstract

Speech deepfake detection (DFD) has benefited from diverse acoustic and semantic speech representations, many of which encode valuable speech information and are costly to train. Existing approaches typically enhance DFD by tuning the representations or applying post-hoc classification on frozen features, limiting control over improving discriminative DF cues without distorting original semantics. We find that emotion is encoded across diverse speech features and correlates with DFD. Therefore, we introduce a unified, feature-agnostic, and non-destructive training framework that uses emotion as a bridging constraint to guide speech features toward DFD, treating emotion recognition as a representation alignment objective rather than an auxiliary task, while preserving the original semantic information. Experiments on FakeOrReal and IntheWild show accuracy improvements of up to 6% and 2%, respectively, with corresponding reductions in equal error rate. Code is in the supplementary materi

Abstract

Related papers