LMDeploy Disaggregated Serving with MooncakeTransferEngine#
Overview#
This is the latest version of the MooncakeTransferEngine integration doc with the LMDeploy project based on PR 3304 and PR 3620 to support KVCache transfer for intra-node and inter-node disaggregated serving scenarios.
Please note that this is still an experimental version and will be modified anytime based on feedback from the LMDeploy community.
Installation#
Prerequisites#
pip install mooncake-transfer-engine
Note:
If any
.so
file is missing, uninstall the pip package withpip3 uninstall mooncake-transfer-engine
, and build the binaries manually from source following the build instructions.
Install the latest version of LMDeploy#
1. Clone LMDeploy from the official repo#
git clone https://github.com/InternLM/lmdeploy.git
2. Build#
2.1 Build from source#
cd LMDeploy
pip install -e .
If you encounter any problems that you cannot solve, please refer to the LMDeploy official installation guide.
Configuration#
Running prefill and decode instances on different nodes#
Proxy:#
lmdeploy serve proxy \
--server-name 192.168.0.147 \
--server-port 8000 \
--routing-strategy "min_expected_latency" \
--serving-strategy DistServe \
--log-level INFO
The
--server-name
and--server-port
parameters specify the LMDeploy proxy service host and listening port.The
--routing-strategy
parameter determines how requests are routed to prefill/decode servers (choose frommin_expected_latency
,min_observed_latency
andrandom
).The
--serving-strategy
parameter sets the serving mode;DistServe
enables disaggregated serving. Default mode ofHybrid
will colocate prefill and decode.The
--log-level
parameter controls the verbosity of runtime logging.
Prefill:#
lmdeploy serve api_server Qwen/Qwen3-8B \
--server-name 192.168.0.101 \
--server-port 23333 \
--role Prefill \
--proxy-url http://192.168.0.147:8000 \
--backend pytorch \
--migration-backend Mooncake
The
--role
parameter sets the node role in the disaggregated system (Prefill
for token embedding and KV cache generation).The
--proxy-url
parameter connects the worker instance back to the proxy for coordination.The
--backend
parameter specifies the model execution backend (e.g.,pytorch
,turbomind
).The
--migration-backend
parameter defines the KV cache transport mechanism (e.g.,Mooncake
andDlSllime
).
Decode:#
lmdeploy serve api_server Qwen/Qwen3-8B \
--server-name 192.168.0.147 \
--server-port 23334 \
--role Decode \
--proxy-url http://192.168.0.147:8000 \
--backend pytorch \
--migration-backend Mooncake
Test Inference:#
curl -X POST "http://192.168.0.147:8000/v1/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-8B",
"temperature": 0,
"prompt": "Shanghai is a city that ",
"max_tokens": 16,
"stream": false
}'
Running prefill and decode Instances on the same nodes#
Proxy:#
lmdeploy serve proxy \
--server-name 192.168.0.147 \
--server-port 8000 \
--routing-strategy "min_expected_latency" \
--serving-strategy DistServe \
--log-level INFO
Prefill:#
CUDA_VISIBLE_DEVICES=0 \
lmdeploy serve api_server Qwen/Qwen3-8B \
--server-name 192.168.0.147 \
--server-port 23333 \
--role Prefill \
--proxy-url http://192.168.0.147:8000 \
--backend pytorch \
--migration-backend Mooncake
The
CUDA_VISIBLE_DEVICES=0
specify the available GPU for prefill service since generally one GPU may not have enough VRAM to run both prefill and decode process.
Decode:#
CUDA_VISIBLE_DEVICES=1 \
lmdeploy serve api_server Qwen/Qwen3-8B \
--server-name 192.168.0.147 \
--server-port 23334 \
--role Decode \
--proxy-url http://192.168.0.147:8000 \
--backend pytorch \
--migration-backend Mooncake
Test Inference:#
curl -X POST "http://192.168.0.147:8000/v1/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-8B",
"temperature": 0,
"prompt": "Shanghai is a city that ",
"max_tokens": 16,
"stream": false
}'
Notes#
You can specify multiple prefill or decode instances with distinct
--server-port
and different GPUs usingCUDA_VISIBLE_DEVICES
.MooncakeTransferEngine supports both intra-node (PCIe) and inter-node (RDMA/CXL) transfer, and device selection is automatic or customizable via config.
When using HF models that timeout during prefill, consider setting model path to
~/Qwen3-8B
to accelerate loading from localhost.Use
--log-level DEBUG
to get detailed runtime logs for troubleshooting.