KV Cache Storage & Sharing with MooncakeStore#

Overview#

This guide demonstrates how to use MooncakeStore / MooncakeStoreConnector with vLLM to build a distributed KV cache storage pool. It enables KV cache offloading to CPU/SSD, hash-based prefix caching across multiple vLLM instances, and flexible XpYd disaggregated deployment — where you can dynamically adjust prefill and decode group sizes at runtime.

Compared to Redis-based backends, MooncakeStore achieves significantly lower TTFT (e.g., ~32% improvement in mean TTFT for 2P2D tp=2 under RDMA). See benchmark results for details.


Choose Your vLLM Backend#

Backend

Connector

vLLM Version

Status

Guide

vLLM V1

MooncakeStoreConnector

Latest

Recommended

Jump to V1 guide

vLLM V0

MooncakeStore

≤ v0.6.4.post1

Legacy

Jump to V0 guide

New Users

If you are starting a new deployment, use the vLLM V1 backend with MooncakeStoreConnector. V0 support is maintained for existing deployments only.

Key differences from v0.x to v1:

  • XpYd support and orchestration: Dynamically change the population of prefill and decode groups

  • More stable and fault-tolerant: A sudden crash of a single vLLM instance is tolerable; instance-to-instance connections are removed, so each instance works as a vanilla vLLM instance capable of handling requests independently



Using vLLM V0 (Legacy)#

Legacy Backend

This section is for vLLM V0 backend with MooncakeStore. For new deployments, use the V1 backend with MooncakeStoreConnector above.

This integration is based on PR 10502 and PR 12957 to support KVCache transfer for intra-node and inter-node disaggregated serving.

Installation#

Prerequisite#

pip3 install mooncake-transfer-engine

Note

  • If you encounter problems such as missing lib*.so, uninstall by pip3 uninstall mooncake-transfer-engine, and build manually according to the instructions.

  • For vLLM version ≤ v0.8.4, it requires mooncake-transfer-engine 0.3.3.post2. The interface mooncake_vllm_adaptor has been deprecated in the latest release.

Install vLLM#

1. Clone vLLM:

git clone git@github.com:vllm-project/vllm.git

2. Build from source:

cd vllm
pip3 install -e .

If you encounter problems, refer to the vLLM official compilation guide.

Configuration#

Prepare configuration for RDMA#

Create a mooncake.json file:

{
    "local_hostname": "192.168.0.137",
    "metadata_server": "etcd://192.168.0.137:2379",
    "protocol": "rdma",
    "device_name": "erdma_0",
    "master_server_address": "192.168.0.137:50001"
}
  • local_hostname: The IP address of the current node. All prefill and decode instances on the same node can share this config.

  • metadata_server: The metadata server. Supports etcd, redis, and http backends.

  • protocol: "rdma" or "tcp".

  • device_name: Required for RDMA. Multiple NICs separated by commas ("erdma_0,erdma_1").

  • master_server_address: The IP address and port of the MooncakeStore master daemon.

Prepare configuration for TCP#

{
    "local_hostname": "192.168.0.137",
    "metadata_server": "etcd://192.168.0.137:2379",
    "protocol": "tcp",
    "device_name": "",
    "master_server_address": "192.168.0.137:50001"
}

Run Example#

Change the IP addresses and ports according to your environment. VLLM_USE_V1=0 is required for vLLM V0 backend.

# Begin from root of your cloned repo!

# 1. Start the etcd server
etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://localhost:2379
# You may need to terminate other etcd processes before running the above command

# 2. Start the mooncake_master server
mooncake_master --port 50001
# If some vllm instances exit unexpectedly, some connection metadata will be
# corrupted since they are not properly cleaned. In that case, we recommend
# you restart the mooncake_master before running another test.

# 3. Run multiple vllm instances
# kv_producer role
MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_V1=0 python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
    --port 8100 \
    --max-model-len 10000 \
    --gpu-memory-utilization 0.8 \
    --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_producer"}'

CUDA_VISIBLE_DEVICES=1 MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_V1=0 python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
    --port 8101 \
    --max-model-len 10000 \
    --gpu-memory-utilization 0.8 \
    --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_producer"}'

CUDA_VISIBLE_DEVICES=2 MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_V1=0 python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
    --port 8102 \
    --max-model-len 10000 \
    --gpu-memory-utilization 0.8 \
    --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_producer"}'

CUDA_VISIBLE_DEVICES=3 MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_V1=0 python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
    --port 8103 \
    --max-model-len 10000 \
    --gpu-memory-utilization 0.8 \
    --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_producer"}'

# kv_consumer role
CUDA_VISIBLE_DEVICES=4 MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_V1=0 python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
    --port 8200 \
    --max-model-len 10000 \
    --gpu-memory-utilization 0.8 \
    --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_consumer"}'

CUDA_VISIBLE_DEVICES=5 MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_V1=0 python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
    --port 8201 \
    --max-model-len 10000 \
    --gpu-memory-utilization 0.8 \
    --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_consumer"}'

CUDA_VISIBLE_DEVICES=6 MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_V1=0 python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
    --port 8202 \
    --max-model-len 10000 \
    --gpu-memory-utilization 0.8 \
    --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_consumer"}'

CUDA_VISIBLE_DEVICES=7 MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_V1=0 python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
    --port 8203 \
    --max-model-len 10000 \
    --gpu-memory-utilization 0.8 \
    --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_consumer"}'

Key parameters:

  • MOONCAKE_CONFIG_PATH: Path to the mooncake.json configuration file.

  • VLLM_USE_MODELSCOPE: Optional. Remove if you have HuggingFace access.

  • VLLM_USE_V1=0: Required since the disaggregated feature is currently only supported on V0 vLLM. You can also export this configuration to the env instead of putting it in front of every command.

  • --model: The model to use.

  • --port: The vllm service port on which to listen.

  • --max-model-len: The maximum length of the model.

  • --tensor-parallel-size / -tp: Supported. All instances should have the same tensor_parallel_size. If running prefill and decode on the same node, set different CUDA_VISIBLE_DEVICES (e.g., CUDA_VISIBLE_DEVICES=0,1 for prefill and CUDA_VISIBLE_DEVICES=2,3 for decode).

  • --kv-transfer-config: Set kv_connector to "MooncakeStoreConnector", kv_role to "kv_producer", "kv_consumer", or "kv_both".

  • If some vLLM instances exit unexpectedly, connection metadata may be corrupted. Restart mooncake_master before another test.

# 5. Start the proxy server
cd vllm
python3 examples/online_serving/disagg_examples/disagg_proxy_demo.py \
    --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
    --prefill localhost:8100 localhost:8101 \
    --decode localhost:8200 localhost:8201 \
    --port 8000
  • --model: The model and tokenizer used by the proxy server.

  • --port: The proxy server port on which to listen.

  • --prefill / -p: IP and port of the vllm prefill instances.

  • --decode / -d: IP and port of the vllm decode instances.

Dynamic XpYd Adjustment#

To dynamically adjust prefill and decode instances at runtime:

export ADMIN_API_KEY="xxxxxxxx"

# or add it before the command:
ADMIN_API_KEY="xxxxxxxx" python3 vllm/examples/online_serving/disagg_examples/disagg_demo.py \
    --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
    --prefill localhost:8100 localhost:8101 \
    --decode localhost:8200 localhost:8201 \
    --port 8000 \
    --scheduling round_robin

# Add instances to groups dynamically
curl -X POST "http://localhost:8000/instances/add" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $ADMIN_API_KEY" \
  -d '{"type": "prefill", "instance": "localhost:8102"}'

curl -X POST "http://localhost:8000/instances/add" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $ADMIN_API_KEY" \
  -d '{"type": "prefill", "instance": "localhost:8103"}'

curl -X POST "http://localhost:8000/instances/add" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $ADMIN_API_KEY" \
  -d '{"type": "decode", "instance": "localhost:8202"}'

curl -X POST "http://localhost:8000/instances/add" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $ADMIN_API_KEY" \
  -d '{"type": "decode", "instance": "localhost:8203"}'

# Get proxy status
curl localhost:8000/status | jq

Note

Mooncake team provides this simple round-robin proxy as a demo. In production, you can implement custom global proxy strategies.

Be sure to change the IP address in the commands.

Test#

curl -s http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4",
    "prompt": "San Francisco is a",
    "max_tokens": 1000
  }'
  • If you are not testing on the proxy server, change localhost to the IP address of the proxy server.


Performance#

Scenario

Document

V1 MooncakeStoreConnector vs Redis

Benchmark V1

V0 MooncakeStore vs Redis

Benchmark V0


Troubleshooting#

  • If you encounter connection issues, check that:

    • All nodes can reach each other over the network

    • Firewall rules allow traffic on the specified ports

    • RDMA devices are properly configured

    • mooncake_master is running and reachable

  • For missing library errors, rebuild mooncake-transfer-engine from source

  • If vLLM instances exit unexpectedly, restart mooncake_master to clean up corrupted metadata

  • Enable debug logging with VLLM_LOGGING_LEVEL=DEBUG for detailed diagnostics