Quick Start: SGLang HiCache with Mooncake Backend

Quick Start: SGLang HiCache with Mooncake Backend#

Follow this streamlined workflow to get SGLang HiCache running with Mooncake as the L3 storage backend.

Need more background or tuning options? See the Complete Guide.

Deployment#

Before you begin, make sure that:

SGLang is installed with HiCache support on the machine hosting your SGLang server. Refer to the official installation guide if needed.
Mooncake is installed and accessible as the hierarchical cache backend. Detailed build steps live in the Mooncake documentation.
The router package is installed to provide the sglang_router entrypoint:

pip install sglang-router

1. Launch the Mooncake master service#

mooncake_master --enable_http_metadata_server=true

2. Launch SGLang with Mooncake L3 storage#

MOONCAKE_MASTER=127.0.0.1:50051 python -m sglang.launch_server \
    --model-path [model_path] \
    --page-size 64 \
    --enable-hierarchical-cache \
    --hicache-storage-prefetch-policy timeout \
    --hicache-storage-backend mooncake

Key flag: --hicache-storage-prefetch-policy {best_effort,wait_complete,timeout} determines when prefetching from storage should stop. timeout usually offers the best balance when Mooncake is the backend.

Prefill/Decode Disaggregation#

The disaggregated setup runs three processes—prefill worker, decode worker, and router. Launch each command below in its own terminal window.

Prefill worker#

MOONCAKE_MASTER=127.0.0.1:50051 python -m sglang.launch_server \
    --model-path [model_path] \
    --page-size 64 \
    --enable-hierarchical-cache \
    --hicache-storage-prefetch-policy timeout \
    --hicache-storage-backend mooncake \
    --disaggregation-mode prefill \
    --disaggregation-ib-device "mlx5_1" \
    --base-gpu-id 0 \
    --port 30000

Decode worker#

python -m sglang.launch_server \
    --model-path [model_path] \
    --page-size 64 \
    --disaggregation-mode decode \
    --disaggregation-ib-device "mlx5_1" \
    --base-gpu-id 1 \
    --port 30001

Router#

python -m sglang_router.launch_router \
    --pd-disaggregation \
    --prefill "http://127.0.0.1:30000" \
    --decode "http://127.0.0.1:30001" \
    --host 0.0.0.0 \
    --port 8000

Smoke test#

curl -X POST http://127.0.0.1:8000/generate \
    -H "Content-Type: application/json" \
    -d '{
  "text": "Let me tell you a long story ",
  "sampling_params": {
    "temperature": 0
  }
}'

Quick tips#

--disaggregation-ib-device is optional—SGLang autodetects devices, but you can set it explicitly (comma-separated, no spaces) when multiple NICs are available.
Use --tp-size to enable tensor-parallel execution across GPUs if required.
Optional flags to experiment with once you need them (some still have compatibility gaps):
- --disaggregation-decode-enable-offload-kvcache writes the decode worker’s outputs back into Mooncake; enable it when you want decoded KV to persist in L3.
Launch dedicated Mooncake store service nodes when you want to scale L3 capacity beyond what the SGLang servers contribute.
HuggingFace timeouts can be mitigated with export SGLANG_USE_MODELSCOPE=true.