Guide: vLLM MooncakeStoreConnector

Guide: vLLM MooncakeStoreConnector#

Overview#

This document describes how to deploy vLLM’s MooncakeStoreConnector. MooncakeStoreConnector is a new vLLM’s KV connector that uses MooncakeDistributedStore as a shared KV cache pool. It enables:

  • CPU/Disk offloading: Extend effective KV cache capacity by offloading to CPU memory or SSD via Mooncake’s transfer engine.

  • Hash-based prefix caching across instances: Multiple vLLM instances share cached KV blocks through the store using block-hash deduplication.

  • Flexible deployment: Works as a single-node KV cache extension (kv_both), or in disaggregated prefill-decode setups (kv_producer / kv_consumer).

Deployment#

1. Prerequisites#

Before you begin, make sure that:

2. Mooncake Master Server#

Start:

mooncake_master --port 50063

Configure Mooncake : Create a JSON configuration file (e.g., mooncake_config.json):

{
  "metadata_server": "http://127.0.0.1:8092/metadata",
  "master_server_address": "127.0.0.1:50063",
  "global_segment_size": "0",
  "local_buffer_size": "2147483648",
  "protocol": "rdma",
  "device_name": "",
}

Set environment variable:

export MOONCAKE_CONFIG_PATH=/path/to/mooncake_config.json

3. Usage#

3.1 Single-Node KV Cache Offloading (i.e., kv_both)

MOONCAKE_CONFIG_PATH=mooncake_config.json \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_both"}'

3.2 XpYd Disaggregated Prefill-Decode (i.e., kv_producer/kv_consumer)

Prefill Node:

MOONCAKE_CONFIG_PATH=mooncake_config.json \
VLLM_MOONCAKE_BOOTSTRAP_PORT=50052 \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --port 8100 \
    --kv-transfer-config '{
        "kv_connector": "MultiConnector",
        "kv_role": "kv_producer",
        "kv_connector_extra_config": {
            "connectors": [
                {
                    "kv_connector": "MooncakeConnector",
                    "kv_role": "kv_producer"
                },
                {
                    "kv_connector": "MooncakeStoreConnector",
                    "kv_role": "kv_producer"
                }
            ]
        }
    }'

Decode Node:

MOONCAKE_CONFIG_PATH=mooncake_config.json \
VLLM_MOONCAKE_BOOTSTRAP_PORT=50053 \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --port 8200 \
    --kv-transfer-config '{
        "kv_connector": "MultiConnector",
        "kv_role": "kv_consumer",
        "kv_connector_extra_config": {
            "connectors": [
                {
                    "kv_connector": "MooncakeConnector",
                    "kv_role": "kv_consumer"
                },
                {
                    "kv_connector": "MooncakeStoreConnector",
                    "kv_role": "kv_consumer"
                }
            ]
        }
    }'

Proxy:

python examples/disaggregated/disaggregated_serving/mooncake_connector/mooncake_connector_proxy.py --prefill http://192.168.0.2:8100 --decode http://192.168.0.3:8200

When running with data parallelism, set a fixed PYTHONHASHSEED so that block hashes are consistent across DP ranks:

PYTHONHASHSEED=0 vllm serve ...

Without this, identical prompts may produce different block hashes on different DP ranks, preventing cross-instance prefix cache hits.