Stylebook: Content-dependent Speaking Style Modeling For Any-to-any Voice Conversion Using Only Speech Data
2023 Β· Hyungseob Lim, Kyungguen Byun, Sunkuk Moon, et al.
Abstract
While many recent any-to-any voice conversion models succeed in transferring some target speech's style information to the converted speech, they still lack the ability to faithfully reproduce the speaking style of the target speaker. In this work, we propose a novel method to extract rich style information from target utterances and to efficiently transfer it to source speech content without requiring text transcriptions or speaker labeling. Our proposed approach introduces an attention mechanism utilizing a self-supervised learning (SSL) model to collect the speaking styles of a target speaker each corresponding to the different phonetic content. The styles are represented with a set of embeddings called stylebook. In the next step, the stylebook is attended with the source speech's phonetic content to determine the final target style for each source content. Finally, content information extracted from the source speech and content-dependent target style embeddings are fed into a dif
Authors
(none)
Tags
Stats
Related papers
- Self-supervised Context-aware Style Representation For Expressive Speech Synthesis (2022)6.34
- Fine-grained Style Modeling, Transfer And Prediction In Text-to-speech Synthesis Via Phone-level Content-style Disentanglement (2020)9.41
- MSM-VC: High-fidelity Source Style Transfer For Non-parallel Voice Conversion By Multi-scale Style Modeling (2023)5.84
- Enriching Source Style Transfer In Recognition-synthesis Based Non-parallel Voice Conversion (2021)9.23
- Styletts: A Style-based Generative Model For Natural And Diverse Text-to-speech Synthesis (2022)10.97
- One-shot Voice Conversion For Style Transfer Based On Speaker Adaptation (2021)8.09
- Speech-to-speech Translation With Discrete-unit-based Style Transfer (2023)0.00
- Styles2st: Zero-shot Style Transfer For Direct Speech-to-speech Translation (2023)0.00