Llama-mimi: Exploring The Limits Of Flattened Speech Language Modeling
2025 Β· Issa Sugiura, Shuhei Kurita, Yusuke Oda, et al.
Abstract
Speech Language Models (SpeechLMs) model tokenized speech to capture both semantic and acoustic information. When neural audio codecs based on Residual Vector Quantization (RVQ) are used as audio tokenizers, they produce multiple discrete tokens per time step, yielding inherently multi-level representations. To process these multi-level tokens together, prior work typically adopts hierarchical architectures to capture this structure. In contrast, recent progress in NLP has progressively reduced architectural inductive biases, moving toward simpler and more scalable single-Transformer architectures. In this work, we propose Llama-Mimi, which flattens multi-level RVQ tokens produced by the Mimi neural audio codec into a single sequence and models them autoregressively with a Transformer decoder. We show that Llama-Mimi outperforms a CSM-based hierarchical model on most tasks and achieves the best performance on acoustic consistency. Our models, code, and speech samples are publicly avail
Authors
(none)
Tags
Stats
Related papers
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59
- Transducer-llama: Integrating Llms Into Streamable Transducer-based Speech Recognition (2024)3.58
- Audiolm: A Language Modeling Approach To Audio Generation (2022)18.91
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00
- Llasa: Scaling Train-time And Inference-time Compute For Llama-based Speech Synthesis (2025)0.00
- Audiochatllama: Towards General-purpose Speech Abilities For Llms (2023)9.41
- MMM: Multi-layer Multi-residual Multi-stream Discrete Speech Representation From Self-supervised Learning Model (2024)6.77
- Tacolm: Gated Attention Equipped Codec Language Model Are Efficient Zero-shot Text To Speech Synthesizers (2024)0.00