vLLM v1 backend Disaggregated Serving with MooncakeConnector#

Overview#

This guide demonstrates how to use the MooncakeConnector with vLLM v1 backend for disaggregated serving in Prefill-Decode separation architecture. The integration enables efficient cross-node KV cache transfer using RDMA technology.

For more details about Mooncake, please refer to Mooncake project and Mooncake documents.

Installation#

Prerequisites#

Install mooncake-transfer-engine through pip:

pip install mooncake-transfer-engine

Note: If you encounter problems such as missing lib*.so, you should uninstall this package by pip3 uninstall mooncake-transfer-engine, and build the binaries manually according to the instructions.

Install vLLM#

Refer to vLLM official installation guide for the latest installation instructions.

Usage#

Basic Setup (Different Nodes)#

Prefiller Node (192.168.0.2)#

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --port 8010 \
  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'

Decoder Node (192.168.0.3)#

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --port 8020 \
  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'

Proxy Server#

# In vllm root directory. 
python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py \
  --prefiller-host 192.168.0.2 --prefiller-port 8010 \
  --decoder-host 192.168.0.3 --decoder-port 8020

NOTE: The Mooncake Connector currently uses the proxy from nixl_integration. This will be replaced with a self-developed proxy in the future.

Now you can send requests to the proxy server through port 8000.

Test#

curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "user", "content": "Tell me a long story about artificial intelligence."}
    ]
  }'

Advanced Configuration#

With Tensor Parallelism#

Prefiller:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
vllm serve Qwen/Qwen2.5-7B-Instruct \
  --port 8010 \
  --tensor-parallel-size 8 \
  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'

Decoder:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
vllm serve Qwen/Qwen2.5-7B-Instruct \
  --port 8020 \
  --tensor-parallel-size 8 \
  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'

Configuration Parameters#

  • --kv-transfer-config: JSON string to configure the KV transfer connector

    • kv_connector: Set to “MooncakeConnector”

    • kv_role: Role of the instance

      • kv_producer: For prefiller instances that generate KV caches

      • kv_consumer: For decoder instances that consume KV caches

      • kv_both: Enables symmetric functionality (experimental)

      • num_workers: Thread pool size in each prefiller worker to send kvcache (default 10)

Environment Variables#

The following environment variables can be used to customize Mooncake behavior:

  • VLLM_MOONCAKE_BOOTSTRAP_PORT: Port for Mooncake bootstrap server

    • Default: 8998

    • Required only for prefiller instances

    • Each vLLM worker needs a unique port on its host

    • For TP/DP deployments, each worker’s port is computed as: base_port + dp_rank * tp_size + tp_rank

  • VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT: Timeout (in seconds) for automatically releasing KV cache

    • Default: 480

    • Used when a request is aborted to prevent holding resources indefinitely

Performance#

For detailed performance benchmarks and results, see the vLLM Benchmark documentation.

Notes#

  • Tensor parallelism (TP) is supported for both prefiller and decoder instances

  • The proxy server should typically run on the decoder node

  • Ensure network connectivity between prefiller and decoder nodes for RDMA transfer

  • For production deployments, consider using a more robust proxy solution

Troubleshooting#

  • If you encounter connection issues, check that:

    • All nodes can reach each other over the network

    • Firewall rules allow traffic on the specified ports

    • RDMA devices are properly configured

  • For missing library errors, rebuild mooncake-transfer-engine from source

  • Enable debug logging with VLLM_LOGGING_LEVEL=DEBUG for detailed diagnostics