vLLM V1 Disaggregated Serving with Mooncake Store and LMCache [MP]#
Overview#
This guide shows a single-machine 1-prefill/1-decode deployment using vLLM V1, LMCache’s multiprocess server, and Mooncake Store as the LMCache L2 backend.
LMCache supports both non-MP mode and MP mode with Mooncake Store. This page
covers the MP path, where vLLM instances connect to an LMCache server through
LMCacheMPConnector, and the LMCache server connects to Mooncake Store through
the mooncake_store L2 adapter. For the non-MP LMCacheConnectorV1 path, see
vLLM V1 Disaggregated Serving with Mooncake Store and LMCache.
In this setup, one machine runs Mooncake master, one LMCache MP server, the disaggregated proxy, the prefiller vLLM instance, and the decoder vLLM instance. The prefiller and decoder should use different GPUs.
This example uses "metadata_server":"P2PHANDSHAKE" for Mooncake transfer
metadata, so the Mooncake HTTP metadata server is not needed. If you switch to
HTTP metadata, remember that Mooncake’s HTTP metadata server also defaults to
8080, which conflicts with LMCache’s HTTP API on a single host.
Prerequisites#
Install Mooncake, vLLM, and LMCache on the machine. The example assumes an RDMA deployment and uses:
local host address:
{IP of Machine}RDMA device:
{RDMA device}LMCache checkout path:
/path/to/LMCache
Replace these values with the local hostname/IP, RDMA device, and LMCache checkout path for your environment.
LMCache requirement: This example requires LMCache v0.4.5 or later. LMCache
must also be built from source with Mooncake support enabled, because the
mooncake_store MP L2 adapter depends on the optional
lmcache.lmcache_mooncake C++ extension.
The standard prebuilt LMCache wheels currently include the Python Mooncake
adapter files, but do not include the optional lmcache.lmcache_mooncake
native extension.
BUILD_MOONCAKE=1 \
MOONCAKE_INCLUDE_DIR=/path/to/mooncake/include \
MOONCAKE_LIB_DIR=/path/to/mooncake/lib \
pip install -e /path/to/LMCache --verbose
Deployment#
1. Start Mooncake Master#
mooncake_master -v=1 \
--rpc_port=50051 \
--metrics_port=9003
2. Start the LMCache Multiprocess Server#
Start one LMCache MP server and configure Mooncake Store as the L2 adapter.
lmcache server \
--host 127.0.0.1 \
--port 5555 \
--http-host 127.0.0.1 \
--http-port 8080 \
--l1-size-gb 32 \
--eviction-policy LRU \
--no-l1-use-lazy \
--l2-adapter '{
"type": "mooncake_store",
"local_hostname": "{IP of Machine}",
"metadata_server": "P2PHANDSHAKE",
"protocol": "rdma",
"rdma_devices": "{RDMA device}",
"global_segment_size": 32212254720,
"local_buffer_size": 1073741824,
"master_server_addr": "127.0.0.1:50051"
}'
3. Start the Disaggregated Proxy#
The proxy receives client requests, sends prefill requests to the prefiller, sends decode requests to the decoder, and receives LMCache request telemetry from the prefiller.
python /path/to/LMCache/examples/disagg_prefill_mp/disagg_proxy_server.py \
--host 127.0.0.1 \
--port 8000 \
--prefiller-host 127.0.0.1 \
--prefiller-port 8100 \
--decoder-host 127.0.0.1 \
--decoder-port 8200 \
--telemetry-port 5768
4. Start the vLLM Prefiller#
The prefiller reports request telemetry back to the proxy so the proxy knows when KV cache storage has completed.
CUDA_VISIBLE_DEVICES=0 \
LMCACHE_REQUEST_TELEMETRY_TYPE=fastapi \
LMCACHE_REQUEST_TELEMETRY_ENDPOINT=http://127.0.0.1:5768/api/v1/telemetry \
vllm serve Qwen/Qwen3-4B \
--host 127.0.0.1 \
--port 8100 \
--gpu-memory-utilization 0.8 \
--no-enable-log-requests \
--no-enable-prefix-caching \
--kv-transfer-config '{
"kv_connector": "LMCacheMPConnector",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"lmcache.mp.host": "tcp://127.0.0.1",
"lmcache.mp.port": 5555
}
}'
5. Start the vLLM Decoder#
The decoder connects to the same local LMCache MP server. It does not need request telemetry environment variables; only the prefiller reports the “KV cache is stored” event back to the proxy.
CUDA_VISIBLE_DEVICES=1 \
vllm serve Qwen/Qwen3-4B \
--host 127.0.0.1 \
--port 8200 \
--gpu-memory-utilization 0.8 \
--no-enable-log-requests \
--no-enable-prefix-caching \
--kv-transfer-config '{
"kv_connector": "LMCacheMPConnector",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"lmcache.mp.host": "tcp://127.0.0.1",
"lmcache.mp.port": 5555
}
}'
6. Send a Test Request#
Send traffic to the proxy, not directly to either vLLM instance.
curl -N http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-4B",
"messages": [
{
"role": "user",
"content": "Explain how KV cache reuse helps long-context serving."
}
],
"max_tokens": 128,
"temperature": 0.7
}'
Port and Configuration Checklist#
When changing ports away from these defaults, update all dependent settings together:
Mooncake master
--rpc_portmust match LMCachemaster_server_addr.The vLLM prefiller and decoder should connect to the local LMCache MP server via
kv_connector_extra_config.lmcache.mp.hostandkv_connector_extra_config.lmcache.mp.port.Proxy
--prefiller-portand--decoder-portmust match the two vLLM--portvalues.LMCACHE_REQUEST_TELEMETRY_ENDPOINTon the prefiller must point to the proxy telemetry endpoint.If
metadata_serveris changed fromP2PHANDSHAKEto an HTTP metadata URL, enable Mooncake HTTP metadata server and make sure its port does not conflict with LMCache--http-port.