AI

SGLang: Efficient Execution of Structured Language Model Programs

Large language models (LLMs) are increasingly utilized for complex tasks requiring multiple generation calls, advanced prompting techniques, control flow, and structured inputs/outputs. However, efficient systems for programming and executing these applications are lacking. SGLang, a newly introduced system, aims to address this by providing efficient execution of complex language model programs. SGLang comprises a frontend language and a runtime. The frontend simplifies programming with primitives for generation and parallelism control, while the runtime accelerates execution through novel optimizations like RadixAttention for KV cache reuse and compressed finite state machines for faster structured output decoding. Experiments demonstrate that SGLang achieves up to 6.4× higher throughput compared to state-of-the-art inference systems on various large language and multimodal models, tackling tasks such as agent control, logical reasoning, few-shot learning benchmarks, JSON decoding, retrieval-augmented generation pipelines, and multi-turn chat.

Recent advancements in LLM capabilities have expanded their utility, enabling them to handle a wider range of general tasks and function as autonomous agents. In these applications, LLMs engage in multi-round planning, reasoning, and interaction with external environments. This is facilitated through tool usage, multiple input modalities, and various prompting techniques, such as few-shot learning, self-consistency, skeleton-of-thought, and tree-of-thought. These new use cases necessitate multiple, often dependent, LLM generation calls, indicating a trend of using multi-call structures to complete complex tasks.

This shift marks a transition from simple chatting to a more sophisticated programmatic usage of LLMs, where programs schedule and control the generation processes of LLMs. These programs are referred to as “Language Model Programs” (LM Programs). Advanced prompting techniques and agentic workflows fall within the scope of LM programs. There are two common properties of LM programs: (1) LM programs typically involve multiple LLM calls interspersed with control flow to complete complex tasks and enhance overall quality. (2) LM programs receive structured inputs and produce structured outputs, enabling the composition of LM programs and integration into existing software systems. 

In this article, we will be taking a deeper dive into the SGLang framework, exploring its architecture, analyzing its performance, and comparing it against state of the art frameworks. So let’s get started. 

Despite the widespread use of LM programs, current systems for expressing and executing them remain inefficient. SGLang identifies two primary challenges associated with the efficient use of LM programs:

  • Programming Complexity: Developing LM programs is tedious and difficult due to the non-deterministic nature of LLMs. This involves extensive string manipulation, experimental tuning of prompts, brittle output parsing, handling multiple input modalities, and implementing parallelism mechanisms. This complexity significantly reduces the readability of even simple programs.
  • Execution Inefficiency: Executing LM programs is inefficient due to redundant computation and memory usage. State-of-the-art inference engines, optimized to reduce latency and improve throughput, lack direct knowledge of the workload, resulting in significant inefficiencies. A notable example is the reuse of the Key-Value (KV) cache, which consists of reusable intermediate tensors essential for generative inference. Current systems lack effective mechanisms to facilitate KV cache reuse across multiple LLM calls that share a common prefix, leading to unnecessary computations and wasted memory. Additionally, constrained decoding for structured outputs, such as JSON mode, is suboptimal as existing systems only decode one token at a time.

To address these challenges, SGLang introduces a Structured Generation Language for LLMs. The core idea is to systematically exploit the multi-call structure in LM programs for efficient execution. As shown in the following figure, SGLang has two parts: a front-end language and a back-end runtime. 

The front-end simplifies the programming of LM programs, and the runtime accelerates their execution. These parts can work together for better performance or function independently. 

SGLang is a domain-specific language embedded in Python, providing primitives for generation (e.g., extend, gen, select) and parallelism control (e.g., fork, join). It is compatible with Python’s control flow and libraries, allowing users to develop advanced prompting workflows easily with native Python syntax. SGLang includes an interpreter and a compiler. The interpreter manages the prompt state as a stream and submits primitive operations to the stream for asynchronous execution, ensuring proper control over synchronization and intra-program parallelism. Additionally, SGLang programs can be traced and compiled for further optimizations.The runtime of SGLang proposes several novel optimizations to accelerate the execution of LM programs:

  • RadixAttention: This technique enables the automatic reuse of the KV cache across multiple generation calls. In existing inference engines, the KV cache of a request is discarded after processing, preventing reuse across multiple calls and slowing execution. SGLang maintains an LRU cache of the KV cache within a radix tree, managing the KV cache as a traditional cache and using the radix tree for efficient matching, insertion, and eviction. This allows the runtime to handle various reuse patterns efficiently.
  • Compressed Finite State Machine: This technique enables faster constrained decoding for structured outputs. Existing systems follow constraints only for the next token, making them able to decode one token at a time. Instead, SGLang analyzes the constraints and builds a compressed finite-state machine to represent them, compressing a multi-token path into a single-step path whenever possible, allowing the decoding of multiple tokens at once for faster speed.
  • API Speculative Execution: For API-only models like OpenAI’s GPT-4, SGLang introduces API speculative execution to optimize multi-call programs.
See also  DeepMind’s Michelangelo Benchmark: Revealing the Limits of Long-Context LLMs

Using SGLang, various LLM applications were implemented, including agent control, logical reasoning, few-shot learning benchmarks, JSON decoding, retrieval-augmented generation pipelines, multi-turn chat, and multi-modality processing. The performance was tested on models including Llama-7B/70B, Mistral-8x7B, LLaVA-v1.5-7B (image), and LLaVA-NeXT-34B (video) on NVIDIA A10G and A100 GPUs. Experimental results show that SGLang achieves up to 6.4× higher throughput across a wide range of workloads, models, and hardware setups, compared to existing programming and inference systems, including Guidance, vLLM, and LMQL. 

SGLang: Programming Model and Methodology

The SGLang programming model is introduced through a running example, describing its language primitives and execution modes, and outlining runtime optimization opportunities. This model simplifies tedious operations in multi-call workflows (e.g., string manipulation, API calling, constraint specification, parallelism) by providing flexible and composable primitives. SGLang is a domain-specific language embedded in Python. The following figure shows a program that evaluates an essay about an image using the branch-solve-merge prompting method. 

The function multi_dimensional_judge takes three arguments: `s`, `path`, and `essay`. s manages the prompt state, path is the image file path, and essay is the essay text. New strings and SGLang primitives can be appended to the state s for execution using the += operator. First, the function adds the image and essay to the prompt. It then checks if the essay is related to the image using select, storing the result in s[“related”]. If related, the prompt is forked into three copies for parallel evaluation from different dimensions, using gen to store results in f[“judgment”]. Next, it merges the judgments, generates a summary, and assigns a letter grade. Finally, it returns the results in JSON format, following a schema defined by a regular expression constraint regex. SGLang greatly simplifies this program, as an equivalent program using an OpenAI API-like interface would take 2.1× as many lines of code due to manual string manipulation and parallelism control.

SGLang provides primitives for controlling prompt state, generation, and parallelism, which can be used with Python syntax and libraries. Here are the primitives:

gen: Calls a model to generate and stores the results in a variable with the name specified in its first argument. It supports a `regex` argument to constrain the output to follow a grammar defined by a regular expression (e.g., a JSON schema).

  • select: Calls a model to choose the highest probability option from a list.
  • += or extend: Appends a string to the prompt.
  • [variable_name]: Fetches the results of a generation.
  • fork: Creates parallel forks of the prompt state.
  • join: Rejoins the prompt state.
  • image and video: Take in image and video inputs.
See also  What is ChatGPT Canvas? The Alternative to Claude Artifacts

The simplest way to execute an SGLang program is through an interpreter, where a prompt is treated as an asynchronous stream. Primitives like extend, gen, and select are submitted to the stream for asynchronous execution. These non-blocking calls allow Python code to continue running without waiting for the generation to finish, similar to launching CUDA kernels asynchronously. Each prompt is managed by a stream executor in a background thread, enabling intra-program parallelism. Fetching generation results will block until they are ready, ensuring correct synchronization. Alternatively, SGLang programs can be compiled as computational graphs and executed with a graph executor, allowing for more optimizations. This paper uses interpreter mode by default and discusses compiler mode results in Appendix D. SGLang supports open-weight models with its own SGLang Runtime (SRT), as well as API models such as OpenAI and Anthropic models.

Programming systems for LLMs can be classified as high-level (e.g., LangChain, DSPy) and low-level (e.g., LMQL, Guidance, SGLang). High-level systems provide predefined or auto-generated prompts, such as DSPy’s prompt optimizer. Low-level systems typically do not alter prompts but allow direct manipulation of prompts and primitives. SGLang is a low-level system similar to LMQL and Guidance. The following table compares their features.

SGLang focuses more on runtime efficiency and comes with its own co-designed runtime, allowing for novel optimizations. High-level languages (e.g., DSPy) can be compiled to low-level languages (e.g., SGLang). The integration of SGLang as a backend in DSPy for better runtime efficiency is demonstrated later.

The above example illustrates RadixAttention operations with an LRU eviction policy across nine time points, showcasing the dynamic evolution of the radix tree in response to various requests. These requests include two chat sessions, a batch of few-shot learning inquiries, and self-consistency sampling. Each tree edge carries a label denoting a substring or a sequence of tokens. The nodes are color-coded to reflect different states: green for newly added nodes, blue for cached nodes accessed during the time point, and red for nodes that have been evicted.

Step 1: The radix tree is initially empty.

Step 2: The server processes an incoming user message “Hello” and responds with the LLM output “Hi”. The system prompt “You are a helpful assistant”, the user message “Hello!”, and the LLM reply “Hi!” are consolidated into the tree as a single edge linked to a new node.

Step 3: A new prompt arrives, and the server finds the prefix of the prompt (i.e., the first turn of the conversation) in the radix tree and reuses its KV cache. The new turn is appended to the tree as a new node.

Step 4: A new chat session begins. The node from Step 3 is split into two nodes to allow the two chat sessions to share the system prompt.

Step 5: The second chat session continues. However, due to memory limits, a node from Step 4 must be evicted. The new turn is appended after the remaining node from Step 4.

Step 6: The server receives a few-shot learning query, processes it, and inserts it into the tree. The root node is split because the new query does not share any prefix with existing nodes.

Step 7: The server receives a batch of additional few-shot learning queries. These queries share the same set of few-shot examples, so a node from Step 6 is split to enable sharing.

See also  Cooking Up Narrative Consistency for Long Video Generation

Step 8: The server receives a new message from the first chat session. It evicts all nodes from the second chat session as they are least recently used.

Step 9: The server receives a request to sample more answers for the questions in a node from Step 8, likely for self-consistency prompting. To make space for these requests, multiple nodes are evicted.

This example demonstrates how RadixAttention handles the dynamic allocation and eviction of nodes in response to different types of requests, ensuring efficient KV cache reuse and memory management.

SGLang: Evaluation and Results

Results on Open-Weight Models

The latency and throughput results are shown in the following figures. SGLang improves throughput by up to 6.4× and reduces latency by up to 3.7×. These improvements result from KV cache reuse, the exploitation of parallelism within a single program, and faster constrained decoding. 

On these benchmarks, the cache hit rate ranges from 50% to 99%. Figure 13 (Appendix) lists the achieved and optimal cache hit rates for all of them, showing that SGLang’s cache-aware scheduling approaches 96% of the optimal hit rate on average.

Results on Larger Models with Tensor Parallelism

Larger models, Mixtral-8x7B and Llama-70B, were tested with tensor parallelism on the same set of benchmarks, and the results are reported in the following figure. The speedup on larger models shows a trend similar to that observed on smaller models, indicating that SGLang’s optimization generalizes well to larger models. Guidance and LMQL were omitted due to the lack of efficient implementations of tensor parallelism.

 Results on Multi-Modal Models

SGLang has native support for multi-modal models with the image and video primitives. The optimizations in this paper are compatible with multi-modal models. For RadixAttention, the hash of the input images is computed and used as the key in the radix tree, allowing the reuse of the KV cache of the image tokens from the same image. LLaVA-v1.5-7B (image) was run on llava-bench-in-the-wild and LLaVA-NeXT-34B (video) on ActivityNet. Because these models are not well supported by other baseline systems, the model authors’ original implementation in Hugging Face Transformers was used as the baseline. As shown in the following table, SGLang provides throughput up to 6× higher on these benchmarks. In llava-bench-in-the-wild, multiple questions about the same image were handled, and SGLang runtime reused the KV cache in this case.

Production Deployment

SGLang has been deployed in Chatbot Arena to serve open-weight models. Due to low traffic for some models, only one SGLang worker serves each. After one month, a 52.4% RadixAttention cache hit rate for LLaVA-Next-34B and 74.1% for Vicuna-33B was observed. Cache hits came from common system messages, frequently reused example images, and multi-turn chat histories. This reduced first-token latency by an average of 1.7× for Vicuna-33B.

Final Thoughts

In this article, we have talked about SGLang, a newly introduced system, aims to address this by providing efficient execution of complex language model programs. SGLang comprises a frontend language and a runtime. The frontend simplifies programming with primitives for generation and parallelism control, while the runtime accelerates execution through novel optimizations like RadixAttention for KV cache reuse and compressed finite state machines for faster structured output decoding. Experiments demonstrate that SGLang achieves up to 6.4× higher throughput compared to state-of-the-art inference systems on various large language and multimodal models, tackling tasks such as agent control, logical reasoning, few-shot learning benchmarks, JSON decoding, retrieval-augmented generation pipelines, and multi-turn chat.

Source link

Related Articles

Back to top button