Improving Sequence-to-sequence Acoustic Modeling By Adding Text-supervision
2018 Β· Jing-Xuan Zhang, Zhen-Hua Ling, Yuan Jiang, et al.
Abstract
This paper presents methods of making using of text supervision to improve the performance of sequence-to-sequence (seq2seq) voice conversion. Compared with conventional frame-to-frame voice conversion approaches, the seq2seq acoustic modeling method proposed in our previous work achieved higher naturalness and similarity. In this paper, we further improve its performance by utilizing the text transcriptions of parallel training data. First, a multi-task learning structure is designed which adds auxiliary classifiers to the middle layers of the seq2seq model and predicts linguistic labels as a secondary task. Second, a data-augmentation method is proposed which utilizes text alignment to produce extra parallel sequences for model training. Experiments are conducted to evaluate our proposed method with training sets at different sizes. Experimental results show that the multi-task learning with linguistic labels is effective at reducing the errors of seq2seq voice conversion. The data-a
Authors
(none)
Tags
Stats
Related papers
- A General Multi-task Learning Framework To Leverage Text Data For Speech To Text Tasks (2020)11.67
- Towards Improving Nam-to-speech Synthesis Intelligibility Using Self-supervised Speech Models (2024)5.84
- Semi-supervised Sequence-to-sequence ASR Using Unpaired Speech And Text (2019)0.00
- Two-stage Training Method For Japanese Electrolaryngeal Speech Enhancement Based On Sequence-to-sequence Voice Conversion (2022)8.35
- Unsupervised Pre-training For Sequence To Sequence Speech Recognition (2019)0.00
- Voice Transformer Network: Sequence-to-sequence Voice Conversion Using Transformer With Text-to-speech Pretraining (2019)13.17
- Almost Unsupervised Text To Speech And Automatic Speech Recognition (2019)0.00
- Any-to-many Voice Conversion With Location-relative Sequence-to-sequence Modeling (2020)14.02