Hifi-codec: Group-residual Vector Quantization For High Fidelity Audio Codec

Abstract

Audio codec models are widely used in audio communication as a crucial technique for compressing audio into discrete representations. Nowadays, audio codec models are increasingly utilized in generation fields as intermediate representations. For instance, AudioLM is an audio generation model that uses the discrete representation of SoundStream as a training target, while VALL-E employs the Encodec model as an intermediate feature to aid TTS tasks. Despite their usefulness, two challenges persist: (1) training these audio codec models can be difficult due to the lack of publicly available training processes and the need for large-scale data and GPUs; (2) achieving good reconstruction performance requires many codebooks, which increases the burden on generation models. In this study, we propose a group-residual vector quantization (GRVQ) technique and use it to develop a novel \textbf\{Hi\}gh \textbf\{Fi\}delity Audio Codec model, HiFi-Codec, which only requires 4 codebooks. We train al

Hifi-codec: Group-residual Vector Quantization For High Fidelity Audio Codec

Abstract

Authors

Tags

Stats

Related papers