Pronunciation Editing For Finnish Speech Using Phonetic Posteriorgrams
2025 Β· Zirui Li, Lauri Juvela, Mikko Kurimo
Abstract
Synthesizing second-language (L2) speech is potentially highly valued for L2 language learning experience and feedback. However, due to the lack of L2 speech synthesis datasets, it is difficult to synthesize L2 speech for low-resourced languages. In this paper, we provide a practical solution for editing native speech to approximate L2 speech and present PPG2Speech, a diffusion-based multispeaker Phonetic-Posteriorgrams-to-Speech model that is capable of editing a single phoneme without text alignment. We use Matcha-TTS's flow-matching decoder as the backbone, transforming Phonetic Posteriorgrams (PPGs) to mel-spectrograms conditioned on external speaker embeddings and pitch. PPG2Speech strengthens the Matcha-TTS's flow-matching decoder with Classifier-free Guidance (CFG) and Sway Sampling. We also propose a new task-specific objective evaluation metric, the Phonetic Aligned Consistency (PAC), between the edited PPGs and the PPGs extracted from the synthetic speech for editing effects.
Authors
(none)
Tags
Stats
Related papers
- Editts: Score-based Editing For Controllable Text-to-speech (2021)10.07
- Editspeech: A Text Based Speech Editing System Using Partial Inference And Bidirectional Fusion (2021)9.92
- Speechblender: Speech Augmentation Framework For Mispronunciation Data Generation (2022)2.26
- Phonology-guided Speech-to-speech Translation For African Languages (2024)2.26
- Fluentlip: A Phonemes-based Two-stage Approach For Audio-driven Lip Synthesis With Optical Flow Consistency (2025)0.00
- Transformer-s2a: Robust And Efficient Speech-to-animation (2021)8.35
- Improving Mispronunciation Detection With Wav2vec2-based Momentum Pseudo-labeling For Accentedness And Intelligibility Assessment (2022)7.16
- Towards Natural And Controllable Cross-lingual Voice Conversion Based On Neural TTS Model And Phonetic Posteriorgram (2021)0.00