Tri-View Collaborative Graph Learning for Robust Deepfake Speech Detection

Abstract

With the rapid advancement of deep learning, text-to-speech and voice conversion systems can now generate synthetic speech, commonly known as deepfake speech, that sounds nearly identical to real human voices. This raises serious threats to public security, driving the urgent need for reliable detection methods. However, most existing approaches rely only on either raw waveform or spectrogram features, overlooking cross-view relationships that could reveal artifacts from unknown spoofing attacks. To address this gap, we propose a tri-view collaborative graph learning framework to enhance detection robustness. Our model integrates three complementary views: 1D waveform features, 2D spectral features, and linguistically derived features from an automatic speech recognition (ASR) system. To improve both discriminability and cross-view independence, we design a Tri-View Contrastive Learning (TVCL) framework, which employs cross-view contrastive loss to emphasize cross-view complementarity and intra-view contrastive loss to strengthen class separation within each view. We further introduce a Dynamic Graph Attention Network (DGAT) that captures temporal dependencies across the multi-view features. Through attention-based aggregation and a learnable weighting mechanism, the DGAT adaptively balances contributions from different views, suppressing noise and promoting complementary cooperation. Finally, the graph-level embeddings produced by the DGAT are used for classification. Extensive experiments on five benchmark datasets demonstrate the superiority of our approach, showing consistent improvements over state-of-the-art methods in cross-method, cross-dataset, and cross-language scenarios.

Abstract

Related papers