NVIDIA's Dynamo Planner Introduces SLO-Focused Automation for Multi-Node LLM Inference
In a significant advancement, Microsoft and NVIDIA have unveiled Part 2 of their partnership aimed at enhancing the performance of large language model (LLM) inference on Azure Kubernetes Service (AKS). The initial announcement set an ambitious goal of achieving 1.2 million tokens processed per second across distributed GPU systems. The latest update, however, shifts its emphasis toward accelerating developer workflows and boosting operational efficiency through innovative automated resource planning and dynamic scaling functionalities.
At the heart of these new features are two integrated tools: the Dynamo Planner Profiler and the SLO-based Dynamo Planner. These components are designed to tackle the "rate matching" challenge commonly encountered in disaggregated serving environments. This term refers to the practice of dividing inference workloads into distinct parts, separating prefill operations that handle input context from decode operations responsible for producing output tokens. Both types of tasks require different GPU pools. Without effective tools, development teams often waste considerable time figuring out the ideal allocation of GPUs for these separate phases of the process.
The Dynamo Planner Profiler serves as a simulation tool utilized prior to deployment. It automates the search for optimal configuration settings, allowing developers to bypass the tedious process of manually testing various parallelization strategies and GPU counts, which can consume hours of valuable GPU resources. Instead, developers simply outline their requirements in a DynamoGraphDeploymentRequest (DGDR) manifest. The profiler then conducts an automated exploration of the configuration landscape, evaluating various tensor parallelism sizes for both the prefill and decode stages. This approach identifies settings that enhance throughput while adhering to established latency constraints.
One impressive feature of the profiler is its AI Configurator mode, which can simulate performance in roughly 20 to 30 seconds based on previously gathered performance metrics. This functionality enables teams to swiftly iterate on configurations before committing physical GPU resources. The resulting output provides a finely-tuned setup to maximize what industry professionals refer to as "Goodput," which represents the highest achievable throughput while remaining within the specified limits for Time to First Token and Inter-Token Latency.
Once the system is operational, the SLO-based Dynamo Planner takes over as the runtime orchestration engine. This component is designed with an understanding of LLMs, meaning it actively monitors the state of the cluster—unlike conventional load balancers. It tracks vital metrics such as key-value cache load within the decode pool and the depth of the prefill queue. By leveraging the performance parameters established by the profiler, the Planner dynamically adjusts the number of prefill and decode workers to ensure that service level objectives are met, even as traffic patterns fluctuate.
To illustrate these capabilities, the announcement presents a scenario involving an airline assistant application. In this case, a Qwen3-32B-FP8 model powers a mobile app for an airline, adhering to strict service level agreements that require a maximum Time to First Token of 500 milliseconds and an Inter-Token Latency of 30 milliseconds. Under normal conditions, when responding to brief passenger queries, the system operates with a single prefill worker and a single decode worker. However, in the event of a weather-related disruption that prompts 200 users to submit complex rerouting requests, the Planner detects the surge in demand. It responds by scaling up to two prefill workers while maintaining one decode worker. Remarkably, reports indicate that the additional worker can be activated within minutes, enabling the system to sustain its latency targets despite the increased workload.
This recent release builds upon the framework introduced in the original Dynamo announcement made in December 2024. In that article, Azure and NVIDIA discussed how Dynamo’s architecture efficiently distributes compute-intensive and memory-bound tasks across multiple GPUs. This strategic separation allows teams to fine-tune each phase of the process independently, ensuring that resources align with the specific needs of the workload. For instance, in an e-commerce application, the prefill task might entail processing thousands of tokens, while the decode task may only require generating concise descriptions.
The transition from manual configuration to automated, SLO-driven resource management illustrates how organizations can effectively manage large language model deployments on Kubernetes. The Planner’s components furnish essential tools that translate latency requirements into informed GPU allocation and scaling decisions. This technological evolution aims to alleviate the operational challenges associated with running disaggregated inference architectures, making it easier for businesses to manage complex multi-node GPU setups while consistently meeting service level targets amid varying traffic patterns.