Low Frame-rate Speech Codec: A Codec Designed For Fast High-quality Speech LLM Training And Inference
2024 Β· Edresson Casanova, Ryan Langman, Paarth Neekhara, et al.
Abstract
Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modeling techniques to audio data. However, audio codecs often operate at high frame rates, resulting in slow training and inference, especially for autoregressive models. To address this challenge, we present the Low Frame-rate Speech Codec (LFSC): a neural audio codec that leverages finite scalar quantization and adversarial training with large speech language models to achieve high-quality audio compression with a 1.89 kbps bitrate and 21.5 frames per second. We demonstrate that our novel codec can make the inference of LLM-based text-to-speech models around three times faster while improving intelligibility and producing quality comparable to previous models.
Authors
(none)
Tags
Stats
Related papers
- Lscodec: Low-bitrate And Speaker-decoupled Discrete Speech Codec (2024)0.00
- Flexicodec: A Dynamic Neural Audio Codec For Low Frame Rates (2025)3.38
- Semanticodec: An Ultra Low Bitrate Semantic Audio Codec For General Sound (2024)10.97
- Uniaudio 1.5: Large Language Model-driven Audio Codec Is A Few-shot Audio Task Learner (2024)0.00
- Codecslime: Temporal Redundancy Compression Of Neural Speech Codec Via Dynamic Frame Rate (2025)0.00
- Socodec: A Semantic-ordered Multi-stream Speech Codec For Efficient Language Model Based Text-to-speech Synthesis (2024)6.34
- Freecodec: A Disentangled Neural Speech Codec With Fewer Tokens (2024)4.52
- Msr-codec: A Low-bitrate Multi-stream Residual Codec For High-fidelity Speech Generation With Information Disentanglement (2025)2.35