Sequence-to-sequence Singing Synthesis Using The Feed-forward Transformer
2019 Β· Merlijn Blaauw, Jordi Bonada
Abstract
We propose a sequence-to-sequence singing synthesizer, which avoids the need for training data with pre-aligned phonetic and acoustic features. Rather than the more common approach of a content-based attention mechanism combined with an autoregressive decoder, we use a different mechanism suitable for feed-forward synthesis. Given that phonetic timings in singing are highly constrained by the musical score, we derive an approximate initial alignment with the help of a simple duration model. Then, using a decoder based on a feed-forward variant of the Transformer model, a series of self-attention and convolutional layers refines the result of the initial alignment to reach the target acoustic features. Advantages of this approach include faster inference and avoiding the exposure bias issues that affect autoregressive models trained by teacher forcing. We evaluate the effectiveness of this model compared to an autoregressive baseline, the importance of self-attention, and the importance
Authors
(none)
Tags
Stats
Related papers
- Singing Synthesis: With A Little Help From My Attention (2019)7.16
- Adversarially Trained Multi-singer Sequence-to-sequence Singing Synthesizer (2020)7.81
- Semi-supervised Learning For Singing Synthesis Timbre (2020)3.58
- Forward Attention In Sequence-to-sequence Acoustic Modelling For Speech Synthesis (2018)12.10
- Singing Voice Synthesis Based On A Musical Note Position-aware Attention Mechanism (2022)0.00
- Leveraging Symmetrical Convolutional Transformer Networks For Speech To Singing Voice Style Transfer (2022)5.84
- A Melody-unsupervision Model For Singing Voice Synthesis (2021)5.84
- Singgan: Generative Adversarial Network For High-fidelity Singing Voice Generation (2021)10.61