vLLM V1 Disaggregated Serving with Mooncake Store and LMCache#
Overview#
The vLLM v1 version has been released with support for PD disaggregation. The detailed design document can be found here. LMCache immediately implemented the corresponding connector to support storage, transmission, and loading of KVCache, enabling collaborative operation with PD nodes. Mooncake, as LMCache’s backend storage engine, has undergone extensive optimizations in usability, performance, and stability. This document explains how to deploy a PD disaggregated serving demo using LMCache + Mooncake.
Deployment#
First, you need to prepare two GPU-equipped machines, which we will refer to as Machine A and Machine B. Install vLLM, Mooncake and LMCache on both Machine A and Machine B. For specific installation instructions, please refer to the official documentation of each repository.
Start the Mooncake Master node on Machine A:
mooncake_master -port 50052 -max_threads 64 -metrics_port 9004
Start the HTTP metadata server (default port 8080) on Machine A for transfer engine metadata:
mooncake_http_metadata_server
Launch the Decoder instance on machine A
Modify the vllm/examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh file.
diff --git a/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh b/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
index 831ef0bb5..a2ff0744c 100644
--- a/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
+++ b/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
elif [[ $1 == "decoder" ]]; then
# Decoder listens on port 8200
- decode_config_file=$SCRIPT_DIR/configs/lmcache-decoder-config.yaml
+ decode_config_file=$SCRIPT_DIR/configs/mooncake-decoder-config.yaml
UCX_TLS=cuda_ipc,cuda_copy,tcp \
LMCACHE_CONFIG_FILE=$decode_config_file \
LMCACHE_USE_EXPERIMENTAL=True \
VLLM_ENABLE_V1_MULTIPROCESSING=1 \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
CUDA_VISIBLE_DEVICES=1 \
Add the
mooncake-decoder-config.yaml
file
chunk_size: 256
remote_url: "mooncakestore://{IP of Machine A}:50052/"
remote_serde: "naive"
local_cpu: False
max_local_cpu_size: 100
extra_config:
local_hostname: "{IP of Machine A}"
metadata_server: "http://{IP of Machine A}:8080/metadata"
protocol: "rdma"
device_name: "mlx5_0" # Multiple RDMA devices can be specified as comma-separated list
master_server_address: "{IP of Machine A}:50052"
global_segment_size: 32212254720 # 30GB
local_buffer_size: 1073741824 # 1GB
transfer_timeout: 1
save_chunk_meta: False
Launch the Decoder instance using command
bash disagg_vllm_launcher.sh decoder Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4
Launch the Prefiller instance on machine B
Modify the vllm/examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh file.
diff --git a/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh b/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
index 831ef0bb5..9e5a3f044 100644
--- a/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
+++ b/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
@@ -18,12 +18,14 @@ fi
if [[ $1 == "prefiller" ]]; then
# Prefiller listens on port 8100
- prefill_config_file=$SCRIPT_DIR/configs/lmcache-prefiller-config.yaml
+ prefill_config_file=$SCRIPT_DIR/configs/mooncake-prefiller-config.yaml
Add the
mooncake-prefiller-config.yaml
file
chunk_size: 256
remote_url: "mooncakestore://{IP of Machine A}:50052/"
remote_serde: "naive"
local_cpu: False
max_local_cpu_size: 100
extra_config:
local_hostname: "{IP of Machine B}"
metadata_server: "http://{IP of Machine A}:8080/metadata"
protocol: "rdma"
device_name: "mlx5_0" # Multiple RDMA devices can be specified as comma-separated list
master_server_address: "{IP of Machine A}:50052"
global_segment_size: 32212254720 # 30GB
local_buffer_size: 1073741824 # 1GB
transfer_timeout: 1
save_chunk_meta: False
Launch the Prefiller instance using command
bash disagg_vllm_launcher.sh prefiller Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4
Launch the
disagg_proxy_server
using command
python3 disagg_proxy_server.py --host localhost --port 9000 --prefiller-host IP_of_Machine_B --prefiller-port 8100 --decoder-host IP_of_Machine_A --decoder-port 8200
Now we can send the requests to the
disagg_proxy_server
to test PD disaggregated serving.