Dm-codec: Distilling Multimodal Representations For Speech Tokenization
2024 Β· Md Mubtasim Ahasan, Md Fahim, Tasnim Mohiuddin, et al.
Abstract
Recent advancements in speech-language models have yielded significant improvements in speech tokenization and synthesis. However, effectively mapping the complex, multidimensional attributes of speech into discrete tokens remains challenging. This process demands acoustic, semantic, and contextual information for precise speech representations. Existing speech representations generally fall into two categories: acoustic tokens from audio codecs and semantic tokens from speech self-supervised learning models. Although recent efforts have unified acoustic and semantic tokens for improved performance, they overlook the crucial role of contextual representation in comprehensive speech modeling. Our empirical investigations reveal that the absence of contextual representations results in elevated Word Error Rate (WER) and Word Information Lost (WIL) scores in speech transcriptions. To address these limitations, we propose two novel distillation approaches: (1) a language model (LM)-guided
Authors
(none)
Tags
Stats
Related papers
- Dmel: Speech Tokenization Made Simple (2024)0.00
- Wavtokenizer: An Efficient Acoustic Discrete Codec Tokenizer For Audio Language Modeling (2024)6.22
- Repcodec: A Speech Representation Codec For Speech Tokenization (2023)8.82
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00
- What Makes A Good Speech Tokenizer For Llm-centric Speech Generation? A Systematic Study (2025)0.00
- Almtokenizer: A Low-bitrate And Semantic-rich Audio Codec Tokenizer For Audio Language Modeling (2025)0.00
- Language-codec: Bridging Discrete Codec Representations And Speech Language Models (2024)4.64
- Codec Does Matter: Exploring The Semantic Shortcoming Of Codec For Audio Language Model (2024)15.02