Dmel: Speech Tokenization Made Simple
2024 Β· Richard He Bai, Tatiana Likhomanenko, Ruixiang Zhang, et al.
Abstract
Large language models have revolutionized natural language processing by leveraging self-supervised pretraining on vast textual data. Inspired by this success, researchers have investigated various compression-based speech tokenization methods to discretize continuous speech signals, enabling the application of language modeling techniques to discrete tokens. However, audio compressor introduces additional complexity and computational cost, and often fail on out-of-domain audio signals. In this work, we introduce a novel speech representation (dmel) that discretizes mel-filterbank channels into intensity bins, creating a simpler yet more effective representation compared to existing speech tokenization methods. Our approach demonstrates superior performance in preserving audio content, robustness to out-of-domain data, and offers a training-free, natural, and streamable representation. To address the high-dimensional nature of log-mel spectrograms, we propose an efficient parallel enco
Authors
(none)
Tags
Stats
Related papers
- Discrete Audio Representation As An Alternative To Mel-spectrograms For Speaker And Speech Recognition (2023)8.60
- Dm-codec: Distilling Multimodal Representations For Speech Tokenization (2024)3.53
- Autoregressive Speech Synthesis Without Vector Quantization (2024)0.00
- SELM: Speech Enhancement Using Discrete Tokens And Language Models (2023)11.19
- Almtokenizer: A Low-bitrate And Semantic-rich Audio Codec Tokenizer For Audio Language Modeling (2025)0.00
- Wavtokenizer: An Efficient Acoustic Discrete Codec Tokenizer For Audio Language Modeling (2024)6.22
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00
- DASH: Dynamic Audio-driven Semantic Chunking For Efficient Omnimodal Token Compression (2026)2.35