SGLang Disaggregated Serving with MooncakeTransferEngine

SGLang Disaggregated Serving with MooncakeTransferEngine#

Overview#

This is the latest version of the MooncakeTransferEngine integration doc with the SGLang project based on PR 4654 and PR 4880 to support KVCache transfer for intra-node and inter-node disaggregated serving scenarios.

Please note that this is still an experimental version and will be modified anytime based on feedback from the SGLang community.

Installation#

Prerequisite#

pip3 install mooncake-transfer-engine

Note: If you encounter problems such as missing lib*.so, you should uninstall this package by pip3 uninstall mooncake-transfer-engine, and build the binaries manually according to the instructions.

Install the latest version of SGLang#

1. Clone SGLang from official repo#

git clone git@github.com:sgl-project/sglang.git

2. Build#

2.1 Build from source#

cd sglang
pip install --upgrade pip
pip install -e "python[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python

If you encounter any problems that you cannot solve, please refer to the SGLang official compilation guide.

Configuration#

Update(Apr 10, 2025) Good news: The configuration file requirement has been removed since PR 5460. There is no need to prepare the mooncake.json file anymore.

To run prefill instance and decode instance on different node#

Prefill:

python -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --disaggregation-mode prefill --port 30000 --host 192.168.0.137 --tp-size 2

The --model-path parameter specifies the model to use.
The --host parameter specifies the SGLang service host.
The --port parameter specifies the SGLang service port on which to listen.
The --disaggregation-mode is the node’s role, either ‘prefill’ or ‘decode’.
The --disaggregation-ib-device is the device to be used for data transmission, it is optional since we will detect this config automatically. Or you can still explicitly specify devices if needed. If multiple NIC devices are used, they can be separated by commas, such as “erdma_0,erdma_1”. Please note that there are no spaces between them.
Option --tp-size is supported. Example: append --tp-size 2 to the run command to run SGLang with multiple GPUs.

Decode:

python -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --disaggregation-mode decode --port 30001 --host 192.168.0.140 --tp-size 2

Proxy:

python3 -m sglang.srt.disaggregation.mini_lb --prefill http://192.168.0.137:30000 --decode http://192.168.0.140:30001 --host 0.0.0.0 --port 8000

Test:

curl -X POST http://127.0.0.1:8000/generate -H "Content-Type: application/json" -d '{
  "text": "Let me tell you a long story ",
  "sampling_params": {
    "temperature": 0
  }
}'

To run prefill instance and decode instance on the same node#