Abstract
Diffusion-based speech generation has achieved remarkable fidelity, increasing the risk of misuse and unauthorized redistribution. However, most existing generative speech watermarking methods are developed for GAN-based pipelines, and watermarking for diffusion-based speech generation remains comparatively underexplored. In addition, prior work often focuses on content-level provenance, while support for model-level and user-level attribution is less mature. We propose \textbf\{TriniMark\}, a diffusion-based generative speech watermarking framework that targets trinity-level traceability, i.e., the ability to associate a generated speech sample with (i) the embedded watermark message (content-level provenance), (ii) the source generative model (model-level attribution), and (iii) the end user who requested generation (user-level traceability). TriniMark uses a lightweight encoder to embed watermark bits into time-domain speech features and reconstruct the waveform, and a temporal-awar