Quick Start: SGLang HiCache with Mooncake Backend#
Follow this streamlined workflow to get SGLang HiCache running with Mooncake as the L3 storage backend.
Need more background or tuning options? See the Complete Guide.
Deployment#
Before you begin, make sure that:
SGLang is installed with HiCache support on the machine hosting your SGLang server. Refer to the official installation guide if needed.
Mooncake is installed and accessible as the hierarchical cache backend. Detailed build steps live in the Mooncake documentation.
If you want HugeTLB-backed host KV memory, reserve the hugepages on the host before launching SGLang. The helper
python3 scripts/check_hicache_hugepage_requirements.py ...from a source checkout, ormooncake-hicache-sizing ...inside the source-built Docker image, can size the required budget from your--hicache-size,MOONCAKE_GLOBAL_SEGMENT_SIZE, andMC_MMAP_ARENA_POOL_SIZEvalues.The router package is installed to provide the
sglang_routerentrypoint:
pip install sglang-router
1. Size and reserve HugeTLB on the host#
python3 scripts/check_hicache_hugepage_requirements.py \
--tp-size 4 \
--hicache-size 64gb \
--global-segment-size 8gb \
--arena-pool-size 56gb \
--available-hugetlb 512gb
sudo sysctl -w vm.nr_hugepages=262144
grep -E 'HugePages_Total|HugePages_Free|Hugepagesize' /proc/meminfo
262144 is an example for 512 GiB of 2 MiB pages. The 64gb / 56gb sizing inputs above are tuned examples for large multi-GPU runs, not defaults. Adjust them, and the resulting vm.nr_hugepages, to match your planned HiCache budget. On smaller hosts, start with an arena pool of 8gb or 16gb.
2. Launch the Mooncake master service#
mooncake_master --enable_http_metadata_server=true
3. Launch SGLang with Mooncake L3 storage#
export MC_STORE_USE_HUGEPAGE="1"
export MC_STORE_HUGEPAGE_SIZE="2MB"
export MOONCAKE_GLOBAL_SEGMENT_SIZE="8gb"
export MC_MMAP_ARENA_POOL_SIZE="56gb"
MOONCAKE_MASTER=127.0.0.1:50051 python -m sglang.launch_server \
--model-path [model_path] \
--page-size 64 \
--enable-hierarchical-cache \
--hicache-storage-prefetch-policy timeout \
--hicache-storage-backend mooncake
Key flag: --hicache-storage-prefetch-policy {best_effort,wait_complete,timeout} determines when prefetching from storage should stop. timeout usually offers the best balance when Mooncake is the backend.
Memory allocator: Mooncake’s mmap arena is opt-in. The 56gb example above is a benchmark-scale tuning value, not the default. Setting MC_MMAP_ARENA_POOL_SIZE enables it and chooses the pool size; if you enable it via gflag instead, the default pool size is 8gb. To force the baseline direct-mmap() path, set MC_DISABLE_MMAP_ARENA=1 (also accepts true, yes, or on) before the first Mooncake mmap-buffer allocation in the process. See the Complete Guide for the full HugeTLB sizing and troubleshooting flow.
Prefill/Decode Disaggregation#
The disaggregated setup runs three processes—prefill worker, decode worker, and router. Launch each command below in its own terminal window.
Prefill worker#
MOONCAKE_MASTER=127.0.0.1:50051 python -m sglang.launch_server \
--model-path [model_path] \
--page-size 64 \
--enable-hierarchical-cache \
--hicache-storage-prefetch-policy timeout \
--hicache-storage-backend mooncake \
--disaggregation-mode prefill \
--disaggregation-ib-device "mlx5_1" \
--base-gpu-id 0 \
--port 30000
Decode worker#
python -m sglang.launch_server \
--model-path [model_path] \
--page-size 64 \
--disaggregation-mode decode \
--disaggregation-ib-device "mlx5_1" \
--base-gpu-id 1 \
--port 30001
Router#
python -m sglang_router.launch_router \
--pd-disaggregation \
--prefill "http://127.0.0.1:30000" \
--decode "http://127.0.0.1:30001" \
--host 0.0.0.0 \
--port 8000
Smoke test#
curl -X POST http://127.0.0.1:8000/generate \
-H "Content-Type: application/json" \
-d '{
"text": "Let me tell you a long story ",
"sampling_params": {
"temperature": 0
}
}'
Quick tips#
--disaggregation-ib-deviceis optional—SGLang autodetects devices, but you can set it explicitly (comma-separated, no spaces) when multiple NICs are available.Use
--tp-sizeto enable tensor-parallel execution across GPUs if required.For Docker launches, use
--ipc=host,--ulimit memlock=-1, and a large--shm-size, then pass the sameMC_STORE_USE_HUGEPAGE,MC_STORE_HUGEPAGE_SIZE,MOONCAKE_GLOBAL_SEGMENT_SIZE, andMC_MMAP_ARENA_POOL_SIZEvalues throughdocker run -e ....Optional flags to experiment with once you need them (some still have compatibility gaps):
--disaggregation-decode-enable-offload-kvcachewrites the decode worker’s outputs back into Mooncake; enable it when you want decoded KV to persist in L3.
Launch dedicated Mooncake
store servicenodes when you want to scale L3 capacity beyond what the SGLang servers contribute.HuggingFace timeouts can be mitigated with
export SGLANG_USE_MODELSCOPE=true.