Triaan-vc: Triple Adaptive Attention Normalization For Any-to-any Voice Conversion
2023 Β· Hyun Joon Park, Seok Woo Yang, Jin Sob Kim, et al.
Abstract
Voice Conversion (VC) must be achieved while maintaining the content of the source speech and representing the characteristics of the target speaker. The existing methods do not simultaneously satisfy the above two aspects of VC, and their conversion outputs suffer from a trade-off problem between maintaining source contents and target characteristics. In this study, we propose Triple Adaptive Attention Normalization VC (TriAAN-VC), comprising an encoder-decoder and an attention-based adaptive normalization block, that can be applied to non-parallel any-to-any VC. The proposed adaptive normalization block extracts target speaker representations and achieves conversion while minimizing the loss of the source content with siamese loss. We evaluated TriAAN-VC on the VCTK dataset in terms of the maintenance of the source content and target speaker similarity. Experimental results for one-shot VC suggest that TriAAN-VC achieves state-of-the-art performance while mitigating the trade-off pro
Authors
(none)
Tags
Stats
Related papers
- AGAIN-VC: A One-shot Voice Conversion Using Activation Guidance And Adaptive Instance Normalization (2020)14.27
- One-shot Voice Conversion By Separating Speaker And Content Representations With Instance Normalization (2019)0.00
- An Adaptive Learning Based Generative Adversarial Network For One-to-one Voice Conversion (2021)10.61
- Assem-vc: Realistic Voice Conversion By Assembling Modern Speech Synthesis Techniques (2021)11.64
- ACVAE-VC: Non-parallel Many-to-many Voice Conversion With Auxiliary Classifier Variational Autoencoder (2018)14.69
- Mediumvc: Any-to-any Voice Conversion Using Synthetic Specific-speaker Speeches As Intermedium Features (2021)0.00
- Pureformer-vc: Non-parallel One-shot Voice Conversion With Pure Transformer Blocks And Triplet Discriminative Training (2024)0.00
- ACE-VC: Adaptive And Controllable Voice Conversion Using Explicitly Disentangled Self-supervised Speech Representations (2023)0.00