Quick Start: SGLang HiCache with Mooncake Backend#
Follow this streamlined workflow to get SGLang HiCache running with Mooncake as the L3 storage backend.
Need more background or tuning options? See the Complete Guide.
Deployment#
Before you begin, make sure that:
SGLang is installed with HiCache support on the machine hosting your SGLang server. Refer to the official installation guide if needed.
Mooncake is installed and accessible as the hierarchical cache backend. Detailed build steps live in the Mooncake documentation.
The router package is installed to provide the
sglang_router
entrypoint:
pip install sglang-router
1. Launch the Mooncake master service#
mooncake_master --enable_http_metadata_server=true
2. Launch SGLang with Mooncake L3 storage#
MOONCAKE_MASTER=127.0.0.1:50051 python -m sglang.launch_server \
--model-path [model_path] \
--page-size 64 \
--enable-hierarchical-cache \
--hicache-storage-prefetch-policy timeout \
--hicache-storage-backend mooncake
Key flag: --hicache-storage-prefetch-policy {best_effort,wait_complete,timeout}
determines when prefetching from storage should stop. timeout
usually offers the best balance when Mooncake is the backend.
Prefill/Decode Disaggregation#
The disaggregated setup runs three processes—prefill worker, decode worker, and router. Launch each command below in its own terminal window.
Prefill worker#
MOONCAKE_MASTER=127.0.0.1:50051 python -m sglang.launch_server \
--model-path [model_path] \
--page-size 64 \
--enable-hierarchical-cache \
--hicache-storage-prefetch-policy timeout \
--hicache-storage-backend mooncake \
--disaggregation-mode prefill \
--disaggregation-ib-device "mlx5_1" \
--base-gpu-id 0 \
--port 30000
Decode worker#
python -m sglang.launch_server \
--model-path [model_path] \
--page-size 64 \
--disaggregation-mode decode \
--disaggregation-ib-device "mlx5_1" \
--base-gpu-id 1 \
--port 30001
Router#
python -m sglang_router.launch_router \
--pd-disaggregation \
--prefill "http://127.0.0.1:30000" \
--decode "http://127.0.0.1:30001" \
--host 0.0.0.0 \
--port 8000
Smoke test#
curl -X POST http://127.0.0.1:8000/generate \
-H "Content-Type: application/json" \
-d '{
"text": "Let me tell you a long story ",
"sampling_params": {
"temperature": 0
}
}'
Quick tips#
--disaggregation-ib-device
is optional—SGLang autodetects devices, but you can set it explicitly (comma-separated, no spaces) when multiple NICs are available.Use
--tp-size
to enable tensor-parallel execution across GPUs if required.Optional flags to experiment with once you need them (some still have compatibility gaps):
--disaggregation-decode-enable-offload-kvcache
writes the decode worker’s outputs back into Mooncake; enable it when you want decoded KV to persist in L3.
Launch dedicated Mooncake
store service
nodes when you want to scale L3 capacity beyond what the SGLang servers contribute.HuggingFace timeouts can be mitigated with
export SGLANG_USE_MODELSCOPE=true
.