Disentangled Feature Learning For Real-time Neural Speech Coding
2022 Β· Xue Jiang, Xiulian Peng, Yuan Zhang, et al.
Abstract
Recently end-to-end neural audio/speech coding has shown its great potential to outperform traditional signal analysis based audio codecs. This is mostly achieved by following the VQ-VAE paradigm where blind features are learned, vector-quantized and coded. In this paper, instead of blind end-to-end learning, we propose to learn disentangled features for real-time neural speech coding. Specifically, more global-like speaker identity and local content features are learned with disentanglement to represent speech. Such a compact feature decomposition not only achieves better coding efficiency by exploiting bit allocation among different features but also provides the flexibility to do audio editing in embedding space, such as voice conversion in real-time communications. Both subjective and objective results demonstrate its coding efficiency and we find that the learned disentangled features show comparable performance on any-to-any voice conversion with modern self-supervised speech rep
Authors
(none)
Tags
Stats
Related papers
- Latent-domain Predictive Neural Speech Coding (2022)12.15
- Freecodec: A Disentangled Neural Speech Codec With Fewer Tokens (2024)4.52
- Investigating Disentanglement In A Phoneme-level Speech Codec For Prosody Modeling (2024)4.52
- Neural Feature Predictor And Discriminative Residual Coding For Low-bitrate Speech Coding (2022)6.77
- Many-to-many Voice Conversion Based Feature Disentanglement Using Variational Autoencoder (2021)7.81
- ESC: Efficient Speech Coding With Cross-scale Residual Vector Quantized Transformers (2024)5.84
- Contrastive Predictive Coding Supported Factorized Variational Autoencoder For Unsupervised Learning Of Disentangled Speech Representations (2020)8.09
- Robust Disentangled Variational Speech Representation Learning For Zero-shot Voice Conversion (2022)10.97