Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents

Abstract

As demand for Large Language Models (LLMs) and AI agents grows rapidly, optimizing systems for efficient LLM inference becomes critical. While significant efforts have targeted system-level engineering, little has been explored from a mathematical modeling and queueing perspective. In this paper, we develop the queueing fundamentals for LLM inference. In particular, we study the throughput aspect of LLM inference systems. We prove that a large class of `work-conserving' scheduling algorithms achieve maximum throughput for both individual requests and AI-agent workloads with directed acyclic graph (DAG) and fork-join routing topologies, establishing `work-conserving' as a key design principle for practitioners. Technically, we develop a fluid-limit framework for multi-class batched processing networks under $K$ -FCFS scheduling, which may be of independent interest. Evaluations of real-world systems confirm that Orca and Sarathi-Serve are throughput-optimal, reassuring practitioners, while FasterTransformer and vanilla vLLM are not maximally stable and should be used with caution. Our analysis also reveals how constraints such as batch size limits and cyclic routing topologies complicate the throughput picture, pointing to rich open questions at the intersection of queueing theory and LLM system design.

Abstract

Related papers