Ssr-speech: Towards Stable, Safe And Robust Zero-shot Text-based Speech Editing And Synthesis
2024 Β· Helin Wang, Meng Yu, Jiarui Hai, et al.
Abstract
In this paper, we introduce SSR-Speech, a neural codec autoregressive model designed for stable, safe, and robust zero-shot textbased speech editing and text-to-speech synthesis. SSR-Speech is built on a Transformer decoder and incorporates classifier-free guidance to enhance the stability of the generation process. A watermark Encodec is proposed to embed frame-level watermarks into the edited regions of the speech so that which parts were edited can be detected. In addition, the waveform reconstruction leverages the original unedited speech segments, providing superior recovery compared to the Encodec model. Our approach achieves state-of-the-art performance in the RealEdit speech editing task and the LibriTTS text-to-speech task, surpassing previous methods. Furthermore, SSR-Speech excels in multi-span speech editing and also demonstrates remarkable robustness to background sounds. The source code and demos are released.
Authors
(none)
Tags
Stats
Related papers
- Livespeech: Low-latency Zero-shot Text-to-speech Via Autoregressive Modeling Of Audio Discrete Codes (2024)5.84
- Speechx: Neural Codec Language Model As A Versatile Speech Transformer (2023)11.85
- Msr-codec: A Low-bitrate Multi-stream Residual Codec For High-fidelity Speech Generation With Information Disentanglement (2025)2.35
- Noise-robust Zero-shot Text-to-speech Synthesis Conditioned On Self-supervised Speech-representation Model With Adapters (2024)7.50
- Editspeech: A Text Based Speech Editing System Using Partial Inference And Bidirectional Fusion (2021)9.92
- Voiceshop: A Unified Speech-to-speech Framework For Identity-preserving Zero-shot Voice Editing (2024)0.00
- Rosettaspeech: Zero-shot Speech-to-speech Translation Without Parallel Speech (2025)0.00
- Super Denoise Net: Speech Super Resolution With Noise Cancellation In Low Sampling Rate Noisy Environments (2023)0.00