Disaggregated Prefill-Decode with MooncakeConnector

Disaggregated Prefill-Decode with MooncakeConnector#

Overview#

This guide demonstrates how to use MooncakeConnector with vLLM for disaggregated Prefill-Decode (PD) serving. MooncakeConnector enables direct cross-node KV cache transfer between prefill and decode instances using RDMA technology, achieving up to 142.25 GB/s peak bandwidth (71.1% utilization of 8x RoCE).

For more details about Mooncake, please refer to Mooncake project and Mooncake documents.

Choose Your vLLM Backend#

Backend	vLLM Version	Status	Guide
vLLM V1	Latest	Recommended	Jump to V1 guide
vLLM V0	≤ v0.6.4.post1	Legacy	Jump to V0 guide

New Users

If you are starting a new deployment, use the vLLM V1 backend. V0 support is maintained for existing deployments only.

Using vLLM V1 (Recommended)#

This section covers MooncakeConnector integration with vLLM V1 backend. The integration enables efficient cross-node KV cache transfer via RDMA.

Installation#

Prerequisites#

Install mooncake-transfer-engine through pip:

pip install mooncake-transfer-engine

Note

If you encounter problems such as missing lib*.so, uninstall this package by pip3 uninstall mooncake-transfer-engine, and build the binaries manually according to the instructions.

Install vLLM#

Refer to vLLM official installation guide for the latest installation instructions.

Usage#

Basic Setup (Different Nodes)#

Prefiller Node (192.168.0.2):

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --port 8010 \
  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'

Decoder Node (192.168.0.3):

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --port 8020 \
  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'

Proxy Server:

# In vllm root directory.
python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py \
  --prefiller-host 192.168.0.2 --prefiller-port 8010 \
  --decoder-host 192.168.0.3 --decoder-port 8020

NOTE: The Mooncake Connector currently uses the proxy from nixl_integration. This will be replaced with a self-developed proxy in the future.

Now you can send requests to the proxy server through port 8000.

Test:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "user", "content": "Tell me a long story about artificial intelligence."}
    ]
  }'

Advanced Configuration#

With Tensor Parallelism:

Prefiller:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
vllm serve Qwen/Qwen2.5-7B-Instruct \
  --port 8010 \
  --tensor-parallel-size 8 \
  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'

Decoder:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
vllm serve Qwen/Qwen2.5-7B-Instruct \
  --port 8020 \
  --tensor-parallel-size 8 \
  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'

Configuration Parameters#

--kv-transfer-config: JSON string to configure the KV transfer connector
- kv_connector: Set to "MooncakeConnector"
- kv_role: Role of the instance
  - kv_producer: For prefiller instances that generate KV caches
  - kv_consumer: For decoder instances that consume KV caches
  - kv_both: Enables symmetric functionality (experimental)
- num_workers: Thread pool size in each prefiller worker to send kvcache (default 10)

Environment Variables#

VLLM_MOONCAKE_BOOTSTRAP_PORT: Port for Mooncake bootstrap server (default: 8998)
- Required only for prefiller instances
- Each vLLM worker needs a unique port on its host
- For TP/DP deployments, each worker’s port is computed as: base_port + dp_rank * tp_size + tp_rank
VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT: Timeout (in seconds) for automatically releasing KV cache (default: 480)
- Used when a request is aborted to prevent holding resources indefinitely

Performance#

For detailed performance benchmarks and results, see the vLLM PD Disaggregation Performance documentation.

Using vLLM V0 (Legacy)#

Legacy Backend

This section is for vLLM V0 backend (≤ v0.6.4.post1). For new deployments, use the V1 backend above.

This integration is based on PR 10502 and PR 10884.

Installation#

Prerequisite#

pip3 install mooncake-transfer-engine

Note

If you encounter problems such as missing lib*.so, uninstall this package by pip3 uninstall mooncake-transfer-engine, and build the binaries manually according to the instructions.
For vLLM version ≤ v0.8.4, it requires mooncake-transfer-engine ≤ 0.3.3.post2. In the latest release, the interface mooncake_vllm_adaptor has been deprecated.

Install vLLM#

1. Clone vLLM from official repo:

git clone git@github.com:vllm-project/vllm.git

2. Build from source (Include C++ and CUDA code):

cd vllm
pip3 uninstall vllm -y
pip3 install -e .

Tip

If the build fails, try upgrading cmake: pip3 install cmake --upgrade.

If you encounter any problems, refer to the vLLM official compilation guide.

Configuration#

Prepare configuration file over RDMA#

Create a mooncake.json file for both Prefill and Decode instances. Use the identical config file on both sides.

{
  "prefill_url": "192.168.0.137:13003",
  "decode_url": "192.168.0.139:13003",
  "metadata_server": "192.168.0.139:2379",
  "metadata_backend": "etcd",
  "protocol": "rdma",
  "device_name": "erdma_0"
}

prefill_url: The IP address and port of the Prefill node (port is used to communicate with metadata server).
decode_url: The IP address and port of the Decode node. If running prefill and decode on the same node, set a different port (at least 50 apart from prefill_url port) to avoid conflicts.
metadata_server: The metadata server address. Supports etcd, redis, and http backends. Example: "etcd://192.168.0.137:2379", "redis://192.168.0.137:6379", "http://192.168.0.137:8080/metadata".
metadata_backend: Currently supports "etcd", "redis", and "http". If absent and metadata_server has no prefix, defaults to "etcd". This parameter will be deprecated in a future version.
protocol: "rdma" or "tcp".
device_name: Required when protocol is "rdma". Multiple NICs can be separated by commas ("erdma_0,erdma_1").

Prepare configuration file over TCP#

{
  "prefill_url": "192.168.0.137:13003",
  "decode_url": "192.168.0.139:13003",
  "metadata_server": "192.168.0.139:2379",
  "metadata_backend": "etcd",
  "protocol": "tcp",
  "device_name": ""
}

Run Example#

Change the IP addresses and ports according to your environment.

# Begin from root of your cloned repo!

# 1. Start the etcd server
etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://localhost:2379

# 2. Run on the prefilling side (producer role)
MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_MODELSCOPE=True python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
    --port 8100 \
    --max-model-len 10000 \
    --gpu-memory-utilization 0.8 \
    --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":2e9}'

# 3. Run on the decoding side (consumer role)
MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_MODELSCOPE=True python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
    --port 8200 \
    --max-model-len 10000 \
    --gpu-memory-utilization 0.8 \
    --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2,"kv_buffer_size":2e9}'

Key parameters:

MOONCAKE_CONFIG_PATH: Path to the mooncake.json configuration file.
VLLM_USE_MODELSCOPE: Optional. Remove if you have HuggingFace access.
--kv-transfer-config: Connector configuration
- kv_connector: "MooncakeConnector"
- kv_role: "kv_producer" or "kv_consumer"
- kv_rank: 0 for producer, 1 for consumer
- kv_parallel_size: Fixed to 2 currently
- kv_buffer_size: KVCache lookup buffer size; increase for longer prompts. If OOM occurs, decrease --gpu-memory-utilization.
- kv_ip and kv_port: Used to specify the IP address and port of the master node for "PyNcclConnector" distributed setup. Not used for "MooncakeConnector" currently. Instead, "MooncakeConnector" uses a config file to set up the distributed connection.
--tensor-parallel-size / -tp: Supported. If running on the same node, set different CUDA_VISIBLE_DEVICES.

Note

If running prefill and decode on the same node, set a different port for decode_url. To avoid port conflicts, ensure the decode port differs by at least 50 from the prefill_url port (e.g., "decode_url": "192.168.0.137:13103"). If the same URL is set for both, the port of decode_url will be automatically incremented by 100.

Proxy Server:

python3 proxy_server.py

# proxy_server.py
import os
import aiohttp
from quart import Quart, make_response, request

AIOHTTP_TIMEOUT = aiohttp.ClientTimeout(total=6 * 60 * 60)
app = Quart(__name__)

async def forward_request(url, data):
    async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
        headers = {"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"}
        async with session.post(url=url, json=data, headers=headers) as response:
            if response.status == 200:
                async for chunk_bytes in response.content.iter_chunked(1024):
                    yield chunk_bytes

@app.route('/v1/completions', methods=['POST'])
async def handle_request():
    try:
        original_request_data = await request.get_json()
        prefill_request = original_request_data.copy()
        prefill_request['max_tokens'] = 1  # prefill only
        async for _ in forward_request('http://localhost:8100/v1/completions', prefill_request):
            continue
        generator = forward_request('http://192.168.0.139:8200/v1/completions',  # Change IP
                                    original_request_data)
        response = await make_response(generator)
        response.timeout = None
        return response
    except Exception as e:
        import sys, traceback
        exc_info = sys.exc_info()
        print("Error occurred in disagg prefill proxy server")
        print(e)
        print("".join(traceback.format_exception(*exc_info)))

if __name__ == '__main__':
    app.run(host="0.0.0.0", port=8000)

Be sure to change the IP address in the proxy server code.

Test#

curl -s http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4",
    "prompt": "San Francisco is a",
    "max_tokens": 1000
  }'

Troubleshooting#

If you encounter connection issues, check that:
- All nodes can reach each other over the network
- Firewall rules allow traffic on the specified ports
- RDMA devices are properly configured and listed in device_name
For missing library errors, rebuild mooncake-transfer-engine from source
Enable debug logging with VLLM_LOGGING_LEVEL=DEBUG for detailed diagnostics
For production deployments, consider using a more robust proxy solution

Disaggregated Prefill-Decode with MooncakeConnector

Contents

Disaggregated Prefill-Decode with MooncakeConnector#

Overview#

Choose Your vLLM Backend#

Using vLLM V1 (Recommended)#

Installation#

Prerequisites#

Install vLLM#

Usage#

Basic Setup (Different Nodes)#

Advanced Configuration#

Configuration Parameters#

Environment Variables#

Performance#

Using vLLM V0 (Legacy)#

Installation#

Prerequisite#

Install vLLM#

Configuration#

Prepare configuration file over RDMA#

Prepare configuration file over TCP#

Run Example#

Test#

Troubleshooting#