Speculative Decoding And Beyond: An In-depth Survey Of Techniques
2025 Β· Yunhai Hu, Zining Liu, Zhenyuan Dong, et al.
Abstract
Sequential dependencies present a fundamental bottleneck in deploying large-scale autoregressive models, particularly for real-time applications. While traditional optimization approaches like pruning and quantization often compromise model quality, recent advances in generation-refinement frameworks demonstrate that this trade-off can be significantly mitigated. This survey presents a comprehensive taxonomy of generation-refinement frameworks, analyzing methods across autoregressive sequence tasks. We categorize methods based on their generation strategies (from simple n-gram prediction to sophisticated draft models) and refinement mechanisms (including single-pass verification and iterative approaches). Through systematic analysis of both algorithmic innovations and system-level implementations, we examine deployment strategies across computing environments and explore applications spanning text, images, and speech generation. This systematic examination of both theoretical framewo
Authors
(none)
Tags
Stats
Related papers
- Iterative Autoregression: A Novel Trick To Improve Your Low-latency Speech Enhancement Model (2022)5.24
- Audio Generation Through Score-based Generative Modeling: Design Principles And Implementation (2025)1.91
- Auto-regressive Vs Flow-matching: A Comparative Study Of Modeling Paradigms For Text-to-music Generation (2025)0.00
- Fast And High-quality Auto-regressive Speech Synthesis Via Speculative Decoding (2024)5.24
- Diffar: Denoising Diffusion Autoregressive Model For Raw Speech Waveform Generation (2023)0.00
- Principled Coarse-grained Acceptance For Speculative Decoding In Speech (2025)0.00
- Advances In GRPO For Generation Models: A Survey (2026)0.00
- Parallel Synthesis For Autoregressive Speech Generation (2022)4.52