Improving Robustness Of One-shot Voice Conversion With Deep Discriminative Speaker Encoder
2021 Β· Hongqiang Du, Lei Xie
Abstract
One-shot voice conversion has received significant attention since only one utterance from source speaker and target speaker respectively is required. Moreover, source speaker and target speaker do not need to be seen during training. However, available one-shot voice conversion approaches are not stable for unseen speakers as the speaker embedding extracted from one utterance of an unseen speaker is not reliable. In this paper, we propose a deep discriminative speaker encoder to extract speaker embedding from one utterance more effectively. Specifically, the speaker encoder first integrates residual network and squeeze-and-excitation network to extract discriminative speaker information in frame level by modeling frame-wise and channel-wise interdependence in features. Then attention mechanism is introduced to further emphasize speaker related information via assigning different weights to frame level speaker information. Finally a statistic pooling layer is used to aggregate weighted
Authors
(none)
Tags
Stats
Related papers
- Residual Speaker Representation For One-shot Voice Conversion (2023)0.00
- One-shot Voice Conversion For Style Transfer Based On Speaker Adaptation (2021)8.09
- MAIN-VC: Lightweight Speech Representation Disentanglement For One-shot Voice Conversion (2024)3.58
- Investigation Of Using Disentangled And Interpretable Representations For One-shot Cross-lingual Voice Conversion (2018)6.77
- Improvement Speaker Similarity For Zero-shot Any-to-any Voice Conversion Of Whispered And Regular Speech (2024)4.52
- Robust Disentangled Variational Speech Representation Learning For Zero-shot Voice Conversion (2022)10.97
- Speech Representation Disentanglement With Adversarial Mutual Information Learning For One-shot Voice Conversion (2022)11.08
- Robust One-shot Singing Voice Conversion (2022)0.00