vLLM v1 backend Disaggregated Serving with MooncakeConnector#
Overview#
This guide demonstrates how to use the MooncakeConnector with vLLM v1 backend for disaggregated serving in Prefill-Decode separation architecture. The integration enables efficient cross-node KV cache transfer using RDMA technology.
For more details about Mooncake, please refer to Mooncake project and Mooncake documents.
Installation#
Prerequisites#
Install mooncake-transfer-engine through pip:
pip install mooncake-transfer-engine
Note: If you encounter problems such as missing lib*.so, you should uninstall this package by pip3 uninstall mooncake-transfer-engine, and build the binaries manually according to the instructions.
Install vLLM#
Refer to vLLM official installation guide for the latest installation instructions.
Usage#
Basic Setup (Different Nodes)#
Prefiller Node (192.168.0.2)#
vllm serve Qwen/Qwen2.5-7B-Instruct \
--port 8010 \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'
Decoder Node (192.168.0.3)#
vllm serve Qwen/Qwen2.5-7B-Instruct \
--port 8020 \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'
Proxy Server#
# In vllm root directory.
python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py \
--prefiller-host 192.168.0.2 --prefiller-port 8010 \
--decoder-host 192.168.0.3 --decoder-port 8020
NOTE: The Mooncake Connector currently uses the proxy from nixl_integration. This will be replaced with a self-developed proxy in the future.
Now you can send requests to the proxy server through port 8000.
Test#
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [
{"role": "user", "content": "Tell me a long story about artificial intelligence."}
]
}'
Advanced Configuration#
With Tensor Parallelism#
Prefiller:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
vllm serve Qwen/Qwen2.5-7B-Instruct \
--port 8010 \
--tensor-parallel-size 8 \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'
Decoder:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
vllm serve Qwen/Qwen2.5-7B-Instruct \
--port 8020 \
--tensor-parallel-size 8 \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'
Configuration Parameters#
--kv-transfer-config: JSON string to configure the KV transfer connectorkv_connector: Set to “MooncakeConnector”kv_role: Role of the instancekv_producer: For prefiller instances that generate KV cacheskv_consumer: For decoder instances that consume KV cacheskv_both: Enables symmetric functionality (experimental)num_workers: Thread pool size in each prefiller worker to send kvcache (default 10)
Environment Variables#
The following environment variables can be used to customize Mooncake behavior:
VLLM_MOONCAKE_BOOTSTRAP_PORT: Port for Mooncake bootstrap serverDefault: 8998
Required only for prefiller instances
Each vLLM worker needs a unique port on its host
For TP/DP deployments, each worker’s port is computed as:
base_port + dp_rank * tp_size + tp_rank
VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT: Timeout (in seconds) for automatically releasing KV cacheDefault: 480
Used when a request is aborted to prevent holding resources indefinitely
Performance#
For detailed performance benchmarks and results, see the vLLM Benchmark documentation.
Notes#
Tensor parallelism (TP) is supported for both prefiller and decoder instances
The proxy server should typically run on the decoder node
Ensure network connectivity between prefiller and decoder nodes for RDMA transfer
For production deployments, consider using a more robust proxy solution
Troubleshooting#
If you encounter connection issues, check that:
All nodes can reach each other over the network
Firewall rules allow traffic on the specified ports
RDMA devices are properly configured
For missing library errors, rebuild mooncake-transfer-engine from source
Enable debug logging with
VLLM_LOGGING_LEVEL=DEBUGfor detailed diagnostics