Investigating Neural Audio Codecs For Speech Language Model-based Speech Generation
2024 Β· Jiaqi Li, Dongmei Wang, Xiaofei Wang, et al.
Abstract
Neural audio codec tokens serve as the fundamental building blocks for speech language model (SLM)-based speech generation. However, there is no systematic understanding on how the codec system affects the speech generation performance of the SLM. In this work, we examine codec tokens within SLM framework for speech generation to provide insights for effective codec design. We retrain existing high-performing neural codec models on the same data set and loss functions to compare their performance in a uniform setting. We integrate codec tokens into two SLM systems: masked-based parallel speech generation system and an auto-regressive (AR) plus non-auto-regressive (NAR) model-based system. Our findings indicate that better speech reconstruction in codec systems does not guarantee improved speech generation in SLM. A high-quality codec decoder is crucial for natural speech production in SLM, while speech intelligibility depends more on quantization mechanism.
Authors
(none)
Tags
Stats
Related papers
- Codec Does Matter: Exploring The Semantic Shortcoming Of Codec For Audio Language Model (2024)15.02
- Language-codec: Bridging Discrete Codec Representations And Speech Language Models (2024)4.64
- Modeling Strategies For Speech Enhancement In The Latent Space Of A Neural Audio Codec (2025)0.00
- Codec-superb @ SLT 2024: A Lightweight Benchmark For Neural Audio Codec Models (2024)7.16
- Analyzing And Mitigating Inconsistency In Discrete Audio Tokens For Neural Codec Language Models (2024)5.84
- Neural Speech And Audio Coding: Modern AI Technology Meets Traditional Codecs (2024)7.16
- Spectral Codecs: Improving Non-autoregressive Speech Synthesis With Spectrogram-based Audio Codecs (2024)0.00
- Speaking From Coarse To Fine: Improving Neural Codec Language Model Via Multi-scale Speech Coding And Generation (2024)3.58