In this guide, you'll learn how to load test vLLM inference servers using LoadForge and Locust. We'll cover how vLLM behaves under load, how to write practical Locust scripts against real vLLM-compatible endpoints, and how to analyze throughput, latency, and failure patterns. How to design a latency-testing protocol that exposes batch, concurrency, and tail-percentile behavior under realistic AI inference load. The easiest latency test to run on an AI inference service is the worst latency test for predicting production behavior: send one request at a time, wait for the. NVIDIA's GenAI-Perf is an open-source benchmarking tool that measures LLM inference performance metrics such as throughput, latency, and token-level metrics. Key metrics for evaluating LLM performance include time to first token (TTFT), end-to-end request latency, intertoken latency (ITL), tokens. Combining --request-rate with --concurrency gives you precise control over both request timing and the maximum number of concurrent connections. vLLM has become a popular choice for serving large language models because of its high-throughput architecture, efficient KV cache management, and continuous. A load test is a test which simulates real world usage of Model Serving endpoints ensuring they meet your production requirements, like latency or requests per second.