Frame-stacked Local Transformers For Efficient Multi-codebook Speech Generation
2025 Β· Roy Fejgin, Paarth Neekhara, Xuesong Yang, et al.
Abstract
Speech generation models based on large language models (LLMs) typically operate on discrete acoustic codes, which differ fundamentally from text tokens due to their multicodebook structure. At each timestep, models must predict N codebook entries jointly, introducing dependencies that challenge simple parallel prediction approaches. Parallel prediction assumes independence among codebooks, yielding efficient decoding but often at the cost of reduced fidelity. To address this, hierarchical strategies employ a local transformer (LT) to refine predictions and capture intra-timestep dependencies. In this work, we systematically investigate two LT architectures: an autoregressive transformer that generates codebooks sequentially, and a MaskGIT-based transformer that performs iterative masked prediction. Both designs further enable frame stacking, where the primary transformer predicts multiple frames jointly, and the LT decodes their codebooks, offering improvements in speed without compro
Authors
(none)
Tags
Stats
Related papers
- Generative Pre-trained Speech Language Model With Efficient Hierarchical Transformer (2024)5.96
- Latent Speech-text Transformer (2025)3.04
- Study Of Lightweight Transformer Architectures For Single-channel Speech Enhancement (2025)3.58
- Paraformer: Fast And Accurate Parallel Transformer For Non-autoregressive End-to-end Speech Recognition (2022)15.10
- Mixture-of-transformers: A Sparse And Scalable Architecture For Multi-modal Foundation Models (2024)0.00
- Maskgct: Zero-shot Text-to-speech With Masked Generative Codec Transformer (2024)7.98
- Spark-tts: An Efficient Llm-based Text-to-speech Model With Single-stream Decoupled Speech Tokens (2025)8.08
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00