Disaggregated Prefill-Decode with MooncakeConnector#

Overview#

This guide demonstrates how to use MooncakeConnector with vLLM for disaggregated Prefill-Decode (PD) serving. MooncakeConnector enables direct cross-node KV cache transfer between prefill and decode instances using RDMA technology, achieving up to 142.25 GB/s peak bandwidth (71.1% utilization of 8x RoCE).

For more details about Mooncake, please refer to Mooncake project and Mooncake documents.


Choose Your vLLM Backend#

Backend

vLLM Version

Status

Guide

vLLM V1

Latest

Recommended

Jump to V1 guide

vLLM V0

≤ v0.6.4.post1

Legacy

Jump to V0 guide

New Users

If you are starting a new deployment, use the vLLM V1 backend. V0 support is maintained for existing deployments only.



Using vLLM V0 (Legacy)#

Legacy Backend

This section is for vLLM V0 backend (≤ v0.6.4.post1). For new deployments, use the V1 backend above.

This integration is based on PR 10502 and PR 10884. Preview benchmark results are available at vLLM Benchmark Results V0.2.

Installation#

Prerequisite#

pip3 install mooncake-transfer-engine

Note

  • If you encounter problems such as missing lib*.so, uninstall this package by pip3 uninstall mooncake-transfer-engine, and build the binaries manually according to the instructions.

  • For vLLM version ≤ v0.8.4, it requires mooncake-transfer-engine 0.3.3.post2. In the latest release, the interface mooncake_vllm_adaptor has been deprecated.

Install vLLM#

1. Clone vLLM from official repo:

git clone git@github.com:vllm-project/vllm.git

2. Build from source (Include C++ and CUDA code):

cd vllm
pip3 uninstall vllm -y
pip3 install -e .

Tip

If the build fails, try upgrading cmake: pip3 install cmake --upgrade.

If you encounter any problems, refer to the vLLM official compilation guide.

Configuration#

Prepare configuration file over RDMA#

Create a mooncake.json file for both Prefill and Decode instances. Use the identical config file on both sides.

{
  "prefill_url": "192.168.0.137:13003",
  "decode_url": "192.168.0.139:13003",
  "metadata_server": "192.168.0.139:2379",
  "metadata_backend": "etcd",
  "protocol": "rdma",
  "device_name": "erdma_0"
}
  • prefill_url: The IP address and port of the Prefill node (port is used to communicate with metadata server).

  • decode_url: The IP address and port of the Decode node. If running prefill and decode on the same node, set a different port (at least 50 apart from prefill_url port) to avoid conflicts.

  • metadata_server: The metadata server address. Supports etcd, redis, and http backends. Example: "etcd://192.168.0.137:2379", "redis://192.168.0.137:6379", "http://192.168.0.137:8080/metadata".

  • metadata_backend: Currently supports "etcd", "redis", and "http". If absent and metadata_server has no prefix, defaults to "etcd". This parameter will be deprecated in a future version.

  • protocol: "rdma" or "tcp".

  • device_name: Required when protocol is "rdma". Multiple NICs can be separated by commas ("erdma_0,erdma_1").

Prepare configuration file over TCP#

{
  "prefill_url": "192.168.0.137:13003",
  "decode_url": "192.168.0.139:13003",
  "metadata_server": "192.168.0.139:2379",
  "metadata_backend": "etcd",
  "protocol": "tcp",
  "device_name": ""
}

Run Example#

Change the IP addresses and ports according to your environment.

# Begin from root of your cloned repo!

# 1. Start the etcd server
etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://localhost:2379

# 2. Run on the prefilling side (producer role)
MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_MODELSCOPE=True python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
    --port 8100 \
    --max-model-len 10000 \
    --gpu-memory-utilization 0.8 \
    --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":2e9}'

# 3. Run on the decoding side (consumer role)
MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_MODELSCOPE=True python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
    --port 8200 \
    --max-model-len 10000 \
    --gpu-memory-utilization 0.8 \
    --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2,"kv_buffer_size":2e9}'

Key parameters:

  • MOONCAKE_CONFIG_PATH: Path to the mooncake.json configuration file.

  • VLLM_USE_MODELSCOPE: Optional. Remove if you have HuggingFace access.

  • --kv-transfer-config: Connector configuration

    • kv_connector: "MooncakeConnector"

    • kv_role: "kv_producer" or "kv_consumer"

    • kv_rank: 0 for producer, 1 for consumer

    • kv_parallel_size: Fixed to 2 currently

    • kv_buffer_size: KVCache lookup buffer size; increase for longer prompts. If OOM occurs, decrease --gpu-memory-utilization.

    • kv_ip and kv_port: Used to specify the IP address and port of the master node for "PyNcclConnector" distributed setup. Not used for "MooncakeConnector" currently. Instead, "MooncakeConnector" uses a config file to set up the distributed connection.

  • --tensor-parallel-size / -tp: Supported. If running on the same node, set different CUDA_VISIBLE_DEVICES.

Note

If running prefill and decode on the same node, set a different port for decode_url. To avoid port conflicts, ensure the decode port differs by at least 50 from the prefill_url port (e.g., "decode_url": "192.168.0.137:13103"). If the same URL is set for both, the port of decode_url will be automatically incremented by 100.

Proxy Server:

python3 proxy_server.py
# proxy_server.py
import os
import aiohttp
from quart import Quart, make_response, request

AIOHTTP_TIMEOUT = aiohttp.ClientTimeout(total=6 * 60 * 60)
app = Quart(__name__)

async def forward_request(url, data):
    async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
        headers = {"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"}
        async with session.post(url=url, json=data, headers=headers) as response:
            if response.status == 200:
                async for chunk_bytes in response.content.iter_chunked(1024):
                    yield chunk_bytes

@app.route('/v1/completions', methods=['POST'])
async def handle_request():
    try:
        original_request_data = await request.get_json()
        prefill_request = original_request_data.copy()
        prefill_request['max_tokens'] = 1  # prefill only
        async for _ in forward_request('http://localhost:8100/v1/completions', prefill_request):
            continue
        generator = forward_request('http://192.168.0.139:8200/v1/completions',  # Change IP
                                    original_request_data)
        response = await make_response(generator)
        response.timeout = None
        return response
    except Exception as e:
        import sys, traceback
        exc_info = sys.exc_info()
        print("Error occurred in disagg prefill proxy server")
        print(e)
        print("".join(traceback.format_exception(*exc_info)))

if __name__ == '__main__':
    app.run(host="0.0.0.0", port=8000)

Be sure to change the IP address in the proxy server code.

Test#

curl -s http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4",
    "prompt": "San Francisco is a",
    "max_tokens": 1000
  }'

Troubleshooting#

  • If you encounter connection issues, check that:

    • All nodes can reach each other over the network

    • Firewall rules allow traffic on the specified ports

    • RDMA devices are properly configured and listed in device_name

  • For missing library errors, rebuild mooncake-transfer-engine from source

  • Enable debug logging with VLLM_LOGGING_LEVEL=DEBUG for detailed diagnostics

  • For production deployments, consider using a more robust proxy solution