DASB - Discrete Audio And Speech Benchmark
2024 Β· Pooneh Mousavi, Jarod Duret, Darius Petermann, et al.
Abstract
Discrete audio tokens have recently gained considerable attention for their potential to bridge audio and language processing, enabling multimodal language models that can both generate and understand audio. However, preserving key information such as phonetic content, speaker identity, and paralinguistic cues remains a major challenge. Identifying the optimal tokenizer and configuration is further complicated by inconsistent evaluation settings across existing studies. To address this, we introduce the Discrete Audio and Speech Benchmark (DASB), a comprehensive framework for benchmarking discrete audio tokens across speech, general audio, and music domains on a range of discriminative and generative tasks. Our results show that discrete representations are less robust than continuous ones and require careful tuning of factors such as model architecture, data size, learning rate, and capacity. Semantic tokens generally outperform acoustic tokens, but a gap remains between discrete toke
Authors
(none)
Tags
Stats
Related papers
- Discrete Audio Representation As An Alternative To Mel-spectrograms For Speaker And Speech Recognition (2023)8.60
- Acoustic BPE For Speech Generation With Discrete Tokens (2023)6.77
- Dmel: Speech Tokenization Made Simple (2024)0.00
- Analyzing And Mitigating Inconsistency In Discrete Audio Tokens For Neural Codec Language Models (2024)5.84
- BEST-STD2.0: Balanced And Efficient Speech Tokenizer For Spoken Term Detection (2025)0.00
- Codec-asr: Training Performant Automatic Speech Recognition Systems With Discrete Speech Representations (2024)6.77
- Dm-codec: Distilling Multimodal Representations For Speech Tokenization (2024)3.53
- The X-LANCE Technical Report For Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge (2024)0.00