Spectral Codecs: Improving Non-autoregressive Speech Synthesis With Spectrogram-based Audio Codecs
2024 Β· Ryan Langman, Ante JukiΔ, Kunal Dhawan, et al.
Abstract
Historically, most speech models in machine-learning have used the mel-spectrogram as a speech representation. Recently, discrete audio tokens produced by neural audio codecs have become a popular alternate speech representation for speech synthesis tasks such as text-to-speech (TTS). However, the data distribution produced by such codecs is too complex for some TTS models to predict, typically requiring large autoregressive models to get good quality. Most existing audio codecs use Residual Vector Quantization (RVQ) to compress and reconstruct the time-domain audio signal. We propose a spectral codec which uses Finite Scalar Quantization (FSQ) to compress the mel-spectrogram and reconstruct the time-domain audio signal. A study of objective audio quality metrics and subjective listening tests suggests that our spectral codec has comparable perceptual quality to equivalent audio codecs. We show that FSQ, and the use of spectral speech representations, can both improve the performance o
Authors
(none)
Tags
Stats
Related papers
- Freecodec: A Disentangled Neural Speech Codec With Fewer Tokens (2024)4.52
- Codec-asr: Training Performant Automatic Speech Recognition Systems With Discrete Speech Representations (2024)6.77
- Optimizing Neural Speech Codec For Low-bitrate Compression Via Multi-scale Encoding (2024)0.00
- Msr-codec: A Low-bitrate Multi-stream Residual Codec For High-fidelity Speech Generation With Information Disentanglement (2025)2.35
- Hifi-codec: Group-residual Vector Quantization For High Fidelity Audio Codec (2023)0.00
- Investigating Neural Audio Codecs For Speech Language Model-based Speech Generation (2024)2.26
- Language-codec: Bridging Discrete Codec Representations And Speech Language Models (2024)4.64
- Low Bit-rate Wideband Speech Coding: A Deep Generative Model Based Approach (2021)0.00