Disaggregated Prefill-Decode with MooncakeConnector#
Overview#
This guide demonstrates how to use MooncakeConnector with vLLM for disaggregated Prefill-Decode (PD) serving. MooncakeConnector enables direct cross-node KV cache transfer between prefill and decode instances using RDMA technology, achieving up to 142.25 GB/s peak bandwidth (71.1% utilization of 8x RoCE).
For more details about Mooncake, please refer to Mooncake project and Mooncake documents.
Choose Your vLLM Backend#
Backend |
vLLM Version |
Status |
Guide |
|---|---|---|---|
vLLM V1 |
Latest |
Recommended |
|
vLLM V0 |
≤ v0.6.4.post1 |
Legacy |
New Users
If you are starting a new deployment, use the vLLM V1 backend. V0 support is maintained for existing deployments only.
Using vLLM V1 (Recommended)#
This section covers MooncakeConnector integration with vLLM V1 backend. The integration enables efficient cross-node KV cache transfer via RDMA.
Installation#
Prerequisites#
Install mooncake-transfer-engine through pip:
pip install mooncake-transfer-engine
Note
If you encounter problems such as missing lib*.so, uninstall this package by pip3 uninstall mooncake-transfer-engine, and build the binaries manually according to the instructions.
Install vLLM#
Refer to vLLM official installation guide for the latest installation instructions.
Usage#
Basic Setup (Different Nodes)#
Prefiller Node (192.168.0.2):
vllm serve Qwen/Qwen2.5-7B-Instruct \
--port 8010 \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'
Decoder Node (192.168.0.3):
vllm serve Qwen/Qwen2.5-7B-Instruct \
--port 8020 \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'
Proxy Server:
# In vllm root directory.
python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py \
--prefiller-host 192.168.0.2 --prefiller-port 8010 \
--decoder-host 192.168.0.3 --decoder-port 8020
NOTE: The Mooncake Connector currently uses the proxy from nixl_integration. This will be replaced with a self-developed proxy in the future.
Now you can send requests to the proxy server through port 8000.
Test:
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [
{"role": "user", "content": "Tell me a long story about artificial intelligence."}
]
}'
Advanced Configuration#
With Tensor Parallelism:
Prefiller:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
vllm serve Qwen/Qwen2.5-7B-Instruct \
--port 8010 \
--tensor-parallel-size 8 \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'
Decoder:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
vllm serve Qwen/Qwen2.5-7B-Instruct \
--port 8020 \
--tensor-parallel-size 8 \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'
Configuration Parameters#
--kv-transfer-config: JSON string to configure the KV transfer connectorkv_connector: Set to"MooncakeConnector"kv_role: Role of the instancekv_producer: For prefiller instances that generate KV cacheskv_consumer: For decoder instances that consume KV cacheskv_both: Enables symmetric functionality (experimental)
num_workers: Thread pool size in each prefiller worker to send kvcache (default 10)
Environment Variables#
VLLM_MOONCAKE_BOOTSTRAP_PORT: Port for Mooncake bootstrap server (default: 8998)Required only for prefiller instances
Each vLLM worker needs a unique port on its host
For TP/DP deployments, each worker’s port is computed as:
base_port + dp_rank * tp_size + tp_rank
VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT: Timeout (in seconds) for automatically releasing KV cache (default: 480)Used when a request is aborted to prevent holding resources indefinitely
Performance#
For detailed performance benchmarks and results, see the vLLM Benchmark documentation.
Using vLLM V0 (Legacy)#
Legacy Backend
This section is for vLLM V0 backend (≤ v0.6.4.post1). For new deployments, use the V1 backend above.
This integration is based on PR 10502 and PR 10884. Preview benchmark results are available at vLLM Benchmark Results V0.2.
Installation#
Prerequisite#
pip3 install mooncake-transfer-engine
Note
If you encounter problems such as missing
lib*.so, uninstall this package bypip3 uninstall mooncake-transfer-engine, and build the binaries manually according to the instructions.For vLLM version ≤ v0.8.4, it requires
mooncake-transfer-engine ≤ 0.3.3.post2. In the latest release, the interfacemooncake_vllm_adaptorhas been deprecated.
Install vLLM#
1. Clone vLLM from official repo:
git clone git@github.com:vllm-project/vllm.git
2. Build from source (Include C++ and CUDA code):
cd vllm
pip3 uninstall vllm -y
pip3 install -e .
Tip
If the build fails, try upgrading cmake: pip3 install cmake --upgrade.
If you encounter any problems, refer to the vLLM official compilation guide.
Configuration#
Prepare configuration file over RDMA#
Create a mooncake.json file for both Prefill and Decode instances. Use the identical config file on both sides.
{
"prefill_url": "192.168.0.137:13003",
"decode_url": "192.168.0.139:13003",
"metadata_server": "192.168.0.139:2379",
"metadata_backend": "etcd",
"protocol": "rdma",
"device_name": "erdma_0"
}
prefill_url: The IP address and port of the Prefill node (port is used to communicate with metadata server).decode_url: The IP address and port of the Decode node. If running prefill and decode on the same node, set a different port (at least 50 apart fromprefill_urlport) to avoid conflicts.metadata_server: The metadata server address. Supportsetcd,redis, andhttpbackends. Example:"etcd://192.168.0.137:2379","redis://192.168.0.137:6379","http://192.168.0.137:8080/metadata".metadata_backend: Currently supports"etcd","redis", and"http". If absent andmetadata_serverhas no prefix, defaults to"etcd". This parameter will be deprecated in a future version.protocol:"rdma"or"tcp".device_name: Required when protocol is"rdma". Multiple NICs can be separated by commas ("erdma_0,erdma_1").
Prepare configuration file over TCP#
{
"prefill_url": "192.168.0.137:13003",
"decode_url": "192.168.0.139:13003",
"metadata_server": "192.168.0.139:2379",
"metadata_backend": "etcd",
"protocol": "tcp",
"device_name": ""
}
Run Example#
Change the IP addresses and ports according to your environment.
# Begin from root of your cloned repo!
# 1. Start the etcd server
etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://localhost:2379
# 2. Run on the prefilling side (producer role)
MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_MODELSCOPE=True python3 -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
--port 8100 \
--max-model-len 10000 \
--gpu-memory-utilization 0.8 \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":2e9}'
# 3. Run on the decoding side (consumer role)
MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_MODELSCOPE=True python3 -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
--port 8200 \
--max-model-len 10000 \
--gpu-memory-utilization 0.8 \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2,"kv_buffer_size":2e9}'
Key parameters:
MOONCAKE_CONFIG_PATH: Path to the mooncake.json configuration file.VLLM_USE_MODELSCOPE: Optional. Remove if you have HuggingFace access.--kv-transfer-config: Connector configurationkv_connector:"MooncakeConnector"kv_role:"kv_producer"or"kv_consumer"kv_rank: 0 for producer, 1 for consumerkv_parallel_size: Fixed to 2 currentlykv_buffer_size: KVCache lookup buffer size; increase for longer prompts. If OOM occurs, decrease--gpu-memory-utilization.kv_ipandkv_port: Used to specify the IP address and port of the master node for"PyNcclConnector"distributed setup. Not used for"MooncakeConnector"currently. Instead,"MooncakeConnector"uses a config file to set up the distributed connection.
--tensor-parallel-size/-tp: Supported. If running on the same node, set differentCUDA_VISIBLE_DEVICES.
Note
If running prefill and decode on the same node, set a different port for decode_url. To avoid port conflicts, ensure the decode port differs by at least 50 from the prefill_url port (e.g., "decode_url": "192.168.0.137:13103"). If the same URL is set for both, the port of decode_url will be automatically incremented by 100.
Proxy Server:
python3 proxy_server.py
# proxy_server.py
import os
import aiohttp
from quart import Quart, make_response, request
AIOHTTP_TIMEOUT = aiohttp.ClientTimeout(total=6 * 60 * 60)
app = Quart(__name__)
async def forward_request(url, data):
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
headers = {"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"}
async with session.post(url=url, json=data, headers=headers) as response:
if response.status == 200:
async for chunk_bytes in response.content.iter_chunked(1024):
yield chunk_bytes
@app.route('/v1/completions', methods=['POST'])
async def handle_request():
try:
original_request_data = await request.get_json()
prefill_request = original_request_data.copy()
prefill_request['max_tokens'] = 1 # prefill only
async for _ in forward_request('http://localhost:8100/v1/completions', prefill_request):
continue
generator = forward_request('http://192.168.0.139:8200/v1/completions', # Change IP
original_request_data)
response = await make_response(generator)
response.timeout = None
return response
except Exception as e:
import sys, traceback
exc_info = sys.exc_info()
print("Error occurred in disagg prefill proxy server")
print(e)
print("".join(traceback.format_exception(*exc_info)))
if __name__ == '__main__':
app.run(host="0.0.0.0", port=8000)
Be sure to change the IP address in the proxy server code.
Test#
curl -s http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4",
"prompt": "San Francisco is a",
"max_tokens": 1000
}'
Troubleshooting#
If you encounter connection issues, check that:
All nodes can reach each other over the network
Firewall rules allow traffic on the specified ports
RDMA devices are properly configured and listed in
device_name
For missing library errors, rebuild
mooncake-transfer-enginefrom sourceEnable debug logging with
VLLM_LOGGING_LEVEL=DEBUGfor detailed diagnosticsFor production deployments, consider using a more robust proxy solution