Abstract
Recent advances in large language models and voice cloning have enabled deepfakes that alter semantic meaning while keeping the speaker's tone and visual identity consistent. Existing datasets mainly capture surface-level acoustic or prosodic artifacts, overlooking semantic manipulations that are harder to detect. This paper introduces FakeSpeech, an audiovisual deepfake dataset designed to benchmark semantic-level speech manipulation under realistic visual alignment. The dataset contains 970 talking-face clips (485 real and 485 fake) derived from FakeAVCeleb videos. Fake samples were generated by rewriting transcripts using GPT-4.5 and resynthesizing voices through ElevenLabs, guided by the Phantom Reading technique to preserve lip synchronization. When evaluated with an audio-only baseline, FakeAVCeleb achieved 0.85 accuracy, whereas FakeSpeech dropped to 0.67, indicating that the added speech manipulation significantly increased dataset complexity and detection difficulty. Overall, FakeSpeech bridges the gap between acoustic and multimodal forgeries, offering a realistic benchmark for studying semanticprosodic alignment and advancing deepfake detection beyond surface artifacts. Unlike prior benchmarks, we keep identity fixed. The video frames are real and unchanged, and the voice stays the same speaker. Only the spoken content changes. In other words, prior work often studies how a person speaks, while we study what the person says in the same familiar way. This content-level manipulation is therefore harder to detect.