Setting up vLLM for maximum performance

ivan · December 1, 2025, 10:24am

{"{"output{“output”:{“output”: "{“output”: "The{“output”: "The v{“output”: "The vLL{“output”: "The vLLM{“output”: "The vLLM Engine{“output”: "The vLLM Engine parameters{“output”: "The vLLM Engine parameters we{“output”: "The vLLM Engine parameters we will{“output”: "The vLLM Engine parameters we will discuss{“output”: "The vLLM Engine parameters we will discuss:{“output”: "The vLLM Engine parameters we will discuss:\n{“output”: "The vLLM Engine parameters we will discuss:\n\n{“output”: "The vLLM Engine parameters we will discuss:\n\n-{“output”: "The vLLM Engine parameters we will discuss:\n\n- max{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-b{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-t{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n-{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n-{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-util{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n-{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-c{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n-{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-ch{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunk{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-p{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-pref{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n-{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-e{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n##{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR**{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used ({“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input +{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output){“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama {“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 {“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be {“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 81{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 819{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-b{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-t{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-util{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, {“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 20{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 204{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n##{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-b{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-t{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it:{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len {“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * {“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then {“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * {“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3…{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [ev{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning]({“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.v{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vll{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](vLLM{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](vLLM{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/per{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-ch{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunk{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-p{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-pref{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as {“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 1{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 12{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a pre{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage ({“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (comput{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks),{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for pre{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill ({“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunk{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-p{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-pref{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more pre{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The v{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLL{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors pre{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory ({“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunk{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-p{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-pref{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section),{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and pre{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here]({“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.v{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vll{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](vLLM{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](vLLM{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/per{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n##{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-c{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-ch{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunk{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-p{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-pref{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-ch{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunk{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-p{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-pref{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid ({“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n##{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-util{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is {“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n##{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-e{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-util{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-b{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-t{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n##{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-ch{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunk{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-p{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-pref{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batch{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from {“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 1{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 12{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to pre{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill ({“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-m{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-m{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication),{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article]({“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://ar{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/23{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/230{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.1{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.163{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.1636{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and [{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and [this{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and [this v{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and [this vLL{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and [this vLLM{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and [this vLLM page{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and [this vLLM page]({“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and [this vLLM page](https{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and [this vLLM page](https://{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and [this vLLM page](https://docs{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and [this vLLM page](https://docs.v{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and [this vLLM page](https://docs.vll{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and [this vLLM page](https://docs.vllm{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and [this vLLM page](https://docs.vllm.ai{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and [this vLLM page](vLLM{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and [this vLLM page](vLLM{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and [this vLLM page](https://docs.vllm.ai/en/latest/models{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and [this vLLM page](https://docs.vllm.ai/en/latest/models/per{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the v{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLL{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors pre{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk pre{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the pre{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunk{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked pre{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining comput{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunk{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked pre{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as {“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 1{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 12{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size ({“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (re{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (redu{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure),{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n##{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use v{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLL{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entry{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints]({“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](v (Vaibhav Verma) · GitHub{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](Vll (Vll) · GitHub{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](vllm (Vernon Masigla) · GitHub{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](vLLM · GitHub{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/v{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vll{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/v{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vll{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](vllm/vllm at main · vllm-project/vllm · GitHub{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](vllm/vllm at main · vllm-project/vllm · GitHub{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entry{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](vllm/vllm/entrypoints at main · vllm-project/vllm · GitHub{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](vllm/vllm/entrypoints/api_server.py at main · vllm-project/vllm · GitHub{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an Open{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunk{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked pre{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\n{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m v{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vll{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entry{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server —{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model Maz{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model Maziy{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model Maziyar{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPan{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/M{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-L{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-In{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-G{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPT{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ —{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-util{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization {"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.9{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 —{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port {"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 50{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 500{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 —{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half —{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-e{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager —{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len {"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 40{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 409{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 —{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-b{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-t{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens {"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 81{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 819{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 —{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-c{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunk{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked pre{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\n{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m v{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vll{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entry{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server —{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model Maz{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model Maziy{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model Maziyar{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPan{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/M{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-L{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-In{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-G{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPT{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ —{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-util{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization {"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.9{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 —{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port {"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 50{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 500{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 —{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half —{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-e{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager —{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len {"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 40{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 409{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 —{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-b{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-t{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens {"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 2{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 25{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 —{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-ch{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunk{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-p{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-pref{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n{"output": "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\n**TL;DR** Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\n**TL;DR** Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or [eviction warning](https://docs.vllm.ai/en/latest/models/performance.html). Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified [here](https://docs.vllm.ai/en/latest/models/performance.html), to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\n**TL;DR** Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\n**TL;DR** The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\n**TL;DR** Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\n**TL;DR** Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to [this article](https://arxiv.org/pdf/2308.16369) and [this vLLM page](https://docs.vllm.ai/en/latest/models/performance.html).\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s [API entrypoints](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py) instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n```\n* If chunked prefill is enabled, the command may look like this:\n```\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n*{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmark{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using [{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using [v{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using [vLL{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using [vLLM{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using [vLLM’s{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using [vLLM’s provided{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using [vLLM’s provided scripts{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using [vLLM’s provided scripts]({“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using [vLLM’s provided scripts](https{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using [vLLM’s provided scripts](https://{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using [vLLM’s provided scripts](https://github{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using [vLLM’s provided scripts](https://github.com{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using [vLLM’s provided scripts](v (Vaibhav Verma) · GitHub{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using [vLLM’s provided scripts](Vll (Vll) · GitHub{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using [vLLM’s provided scripts](vllm (Vernon Masigla) · GitHub{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using [vLLM’s provided scripts](vLLM · GitHub{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using [vLLM’s provided scripts](https://github.com/vllm-project/v{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using [vLLM’s provided scripts](https://github.com/vllm-project/vll{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using [vLLM’s provided scripts](GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using [vLLM’s provided scripts](https://github.com/vllm-project/vllm/tree{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using [vLLM’s provided scripts](https://github.com/vllm-project/vllm/tree/main{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using [vLLM’s provided scripts](https://github.com/vllm-project/vllm/tree/main/b{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using [vLLM’s provided scripts](https://github.com/vllm-project/vllm/tree/main/bench{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using [vLLM’s provided scripts](https://github.com/vllm-project/vllm/tree/main/benchmarks{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using vLLM’s provided scripts,{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using vLLM’s provided scripts, and{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using vLLM’s provided scripts, and I{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using vLLM’s provided scripts, and I will{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using vLLM’s provided scripts, and I will also{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using vLLM’s provided scripts, and I will also attempt{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using vLLM’s provided scripts, and I will also attempt to{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using vLLM’s provided scripts, and I will also attempt to write{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using vLLM’s provided scripts, and I will also attempt to write a{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using vLLM’s provided scripts, and I will also attempt to write a similar{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using vLLM’s provided scripts, and I will also attempt to write a similar script{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using vLLM’s provided scripts, and I will also attempt to write a similar script and{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using vLLM’s provided scripts, and I will also attempt to write a similar script and share{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using vLLM’s provided scripts, and I will also attempt to write a similar script and share it{“output”: "The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using vLLM’s provided scripts, and I will also attempt to write a similar script and share it here{“output”: “The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using vLLM’s provided scripts, and I will also attempt to write a similar script and share it here.”{“output”: “The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using vLLM’s provided scripts, and I will also attempt to write a similar script and share it here.”}{“output”: “The vLLM Engine parameters we will discuss:\n\n- max-num-batched-tokens\n- max-model-len\n- gpu-memory-utilization\n- enable-prefix-caching\n- enable-chunked-prefill\n- enforce-eager\n\n## max-model-len:\n\nTL;DR Set according to the maximum number of tokens used (input + output)\n\n* By default, the maximum model length equals the maximum context length of your model. For example, for the llama 3 8B instruct model, it will be 8192.\n* If you can determine the maximum context length for your use case and it is less than your model’s maximum context length, it is better to set this parameter to that value.\n* This not only prevents memory shortage errors during model loading or usage but also helps configure other parameters such as max-num-batched-tokens and gpu-memory-utilization.\n* This value includes both input and output tokens, so if your use case involves no more than, say, 2046 tokens in total, set this parameter accordingly.\n\n## max-num-batched-tokens:\n\nTL;DR Start with max-model-len, then try doubling it: max-model-len * 2, then * 3… until you encounter a memory shortage error or eviction warning. Choose the value that provides optimal performance. If you use enable-chunked-prefill, start with a smaller value, such as 128, and experiment with different values. See explanations below.\n\n* Note that this parameter denotes the number of tokens, not the batch size.\n* Each request must go through a prefill stage (computing KV values for input tokens and distributing them into KV cache blocks), followed by a decoding stage, where recursive forward computation generates tokens one by one.\n* Thus, the minimum value of this parameter equals the maximum model length you set for prefill (unless you use chunked-prefill, discussed later).\n* The larger the batch size in terms of tokens, the more prefill operations occur, meaning the first token will be generated faster. The vLLM scheduler by default favors prefill.\n* Therefore, a larger batch size can enhance throughput, but this does not always apply, as throughput and latency also depend on the balance of computational resources and GPU memory (see chunked-prefill section), and prefill operations typically consume significant computational resources, quickly reaching the maximum token batch size we can retain.\n* Hence, it is better to start with max-model-len, try doubling it, and observe how performance changes until you encounter a memory shortage error or eviction warning, as specified here, to select the optimal batch token count. Then, apply this principle to understand how your system operates.\n\n## enable-prefix-caching:\n\nTL;DR Enable it unless you are using enable-chunked-prefill\n\n* It caches computed KV values for reuse, saving considerable time.\n* If most of your request remains unchanged upon input data, you will notice a more significant performance boost compared to configurations where only a small portion of tokens in the request remains unchanged.\n* If the enable-chunked-prefill parameter is used, this parameter is invalid (as of the time of this post’s publication).\n\n## gpu-memory-utilization:\n\nTL;DR The higher, the better, unless memory shortage occurs. Default is 0.9\n\n* The higher the GPU memory allocated for storing KV cache, the better the performance.\n* Reduce this value if you encounter memory shortage errors or if the GPU is performing another task requiring a reduction in this value.\n\n## enforce-eager:\n\nTL;DR Enable this if memory is insufficient\n\n* When enabled, no CUDA graph will be created, which may affect performance but will save memory that would otherwise be required for the graph.\n* Under certain configurations, the CUDA graph may not significantly impact performance, and the saved memory can be beneficial for increasing the values of the aforementioned parameters, such as gpu-memory-utilization, max-num-batched-tokens, etc.\n\n## enable-chunked-prefill:\n\nTL;DR Try enabling this feature and use batched tokens starting from 128 and gradually increasing. Compare with the configuration without this parameter. See details below.\n\n* The decoding stage is less demanding on computational resources compared to prefill (vector-matrix multiplication instead of matrix-matrix multiplication), but more demanding on memory, as it requires reading computed KV values.\n* For further reading, refer to this article and this vLLM page.\n* As previously mentioned, the vLLM scheduler by default favors prefill, which demands computational resources, while during decoding, memory becomes the bottleneck, preventing full utilization of decoding with low computational demands and high memory requirements, as KV values must be read.\n* Chunk prefill divides the prefill stage into chunks, allowing a reduction in the batch token count and enabling the formation of a batch by collecting decoding requests and adding chunked prefill requests, thereby efficiently utilizing GPU resources by combining computationally intensive chunked prefill with sufficient decoding requests that are low in computational demand but high in memory requirements.\n* Try enabling this feature again and use a small maximum batch token count, such as 128, gradually increasing it until you achieve the desired throughput and latency.\n* This method may reduce token latency, as decoding takes priority due to the small batch size (reducing memory pressure), but may affect throughput, so it is necessary to experiment with different batch sizes and compare results when this parameter is not used.\n\n## Example:\n\n* We will use vLLM’s API entrypoints instead of an OpenAI-compatible server.\n* To run the server without chunked prefill, use the following example:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 8192 — enable-prefix-caching\n\n* If chunked prefill is enabled, the command may look like this:\n\npython3 -m vllm.entrypoints.api_server — model MaziyarPanahi/Meta-Llama-3–8B-Instruct-GPTQ — gpu-memory-utilization 0.95 — port 5000 — dtype half — enforce-eager — max-model-len 4096 — max-num-batched-tokens 256 — enable-chunked-prefill\n\n* Send a request to the server’s endpoint mentioned above to evaluate performance under different parameters.\n* Conduct benchmarking using vLLM’s provided scripts, and I will also attempt to write a similar script and share it here.”}

Topic		Replies	Views
Автоматизация vLLM на базе GPU с помощью GitLab CI/CD и RunPod AI документация	0	6	December 8, 2025
GPU калькулятор AI документация	0	12	November 11, 2025
AI для разработчика (часть 1, IDE) AI диаграмма	1	31	July 7, 2025
Развертывание LM Studio как сервиса в Ubuntu 25.04 AI	2	27	November 21, 2025
Много дел – это нормально Основная	0	6	October 13, 2025

Setting up vLLM for maximum performance

Related topics