vLLM with Mooncake Transfer Engine Benchmark

vLLM with Mooncake Transfer Engine Benchmark#

Mooncake has now implemented a vLLM connector, enabling direct support for the Prefill-Decode (PD) separation architecture in vLLM v1. We evaluated the performance of this integration, focusing on the efficiency of cross-node KV cache transfer using RDMA.

Benchmark Result#

Bandwidth Performance#

We measured the actual transfer bandwidth during the execution of requests with varying prompt lengths.

KV Transfer Bandwidth (Actual)

In a 1P1D (1 Prefiller, 1 Decoder) configuration using the Qwen3-8B model, Mooncake achieved a peak actual transfer bandwidth of 142.25 GB/s. Given the theoretical maximum bandwidth of approximately 200 GB/s for the 8x RoCE connections, this represents a 71.1% bandwidth utilization rate. This efficiency demonstrates that the custom transfer protocol and GPU Direct RDMA capabilities can effectively saturate high-performance networks.

End-to-End Latency (TTFT)#

We analyzed the Time To First Token (TTFT) to understand the impact of KV transfer overhead on end-to-end latency.

TTFT Breakdown

Transfer Time vs KV Size

The results show that Mooncake’s high-speed transfer ensures that the overhead of moving KV cache is negligible compared to the computation time. For a prompt length of 32,768 tokens (transferring 4.50 GB of data), the actual KV transfer took only 31.65 ms, accounting for merely 4.2% of the total TTFT.

Detailed Performance Data:

Prompt Length	Mean TTFT (ms)	KV Size	Actual Transfer Time (ms)	Actual Bandwidth (GB/s)	Bandwidth Utilization
128 tokens	46.09	20 MB	0.54	36.53	18.3%
256 tokens	48.04	38 MB	0.61	60.78	30.4%
512 tokens	59.91	74 MB	0.92	78.73	39.4%
1024 tokens	67.29	146 MB	1.50	95.23	47.6%
2048 tokens	85.31	290 MB	2.51	112.88	56.4%
4096 tokens	124.42	578 MB	4.75	119.00	59.5%
8192 tokens	212.05	1.13 GB	8.84	127.57	63.8%
16384 tokens	387.52	2.25 GB	16.43	137.09	68.5%
32768 tokens	749.62	4.50 GB	31.65	142.25	71.1%

Benchmark Setup#

H800 Cluster#

Experimental Environment

• Hardware Configuration: NVIDIA H800 (81GB) x 16 (8 per node), 8x Mellanox ConnectX-7 (RoCE over Ethernet). • Topology: Prefiller and Decoder connected via RoCE. • Model: Qwen3-8B • vLLM Version: 0.11.2.dev358 • KV Connector: MooncakeConnector • Transfer Method: Cross-Node RDMA

Launch Commands#

Prefiller :

VLLM_LOGGING_LEVEL=DEBUG CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  python -m vllm.entrypoints.openai.api_server \
  --model /work/models/Qwen3-8B \
  --host 0.0.0.0 --port 8010 \
  --tensor-parallel-size 8 \
  --no-enable-prefix-caching \
  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'

Decoder :

VLLM_LOGGING_LEVEL=DEBUG CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  python -m vllm.entrypoints.openai.api_server \
  --model /work/models/Qwen3-8B \
  --host 0.0.0.0 --port 8020 \
  --tensor-parallel-size 8 \
  --no-enable-prefix-caching \
  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'

Proxy (Decoder Node):

python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py \
  --host 0.0.0.0 --port 8000 \
  --prefiller-host 10.0.28.193 --prefiller-port 8010 \
  --decoder-host 10.0.28.202 --decoder-port 8020

Benchmark Script#

We used vllm bench serve to generate traffic with varying prompt lengths.

for prompt_len in 128 256 512 1024 2048 4096 8192 16384 32768; do
  vllm bench serve \
    --model /work/models/Qwen3-8B \
    --num-prompts 50 \
    --random-input-len ${prompt_len} \
    --random-output-len 128 \
    --base-url http://127.0.0.1:8000 \
    --backend openai-chat \
    --endpoint /v1/chat/completions \
    --max-concurrency 1 \
    --dataset-name random
done

By the Mooncake Team