Mooncake Store Deployment & Tuning Guide#
This guide covers minimal deployment, and operational tuning of Mooncake Store.
Architecture Overview#

Master Service (mooncake_master): The central coordinator. It manages cluster membership, allocates object storage across client nodes, and enforces eviction/placement policies. Runs as a standalone process.
Client Node: Each node contributes DRAM (and optionally VRAM/SSD) to form the distributed cache pool. Clients communicate with the master over RPC for control operations (Put/Get/Remove), but transfer actual data directly between each other via the Transfer Engine — the master is never in the data path.
Metadata Service: A separate service (etcd, Redis, or HTTP) used by the Transfer Engine for peer discovery and configuration. The master’s embedded HTTP metadata server can replace an external etcd/Redis for simple deployments. We also provide a P2P handshake mechanism (P2PHANDSHAKE) that enables decentralized metadata management by storing metadata locally on each node, eliminating the need for a centralized service — this is the simplest metadata handshake method and the recommended starting point (see Quick Start).
For a detailed design discussion, see the Mooncake Store Design.
Quick Start#
Deploy a minimal single-node Mooncake Store in three steps.
1. Start the Metadata Service#
Choose one option:
# Option A: P2P handshake — the simplest metadata handshake method (recommended)
# Nothing to start here. There is no metadata service to deploy: each node
# exchanges and stores metadata locally during connection setup. Just set the
# client's metadata_server to the literal string "P2PHANDSHAKE" (see step 3).
# Option B: Embed HTTP metadata server in the master
# (configured in step 2 via --enable_http_metadata_server)
# Option C: External etcd
etcd --listen-client-urls http://0.0.0.0:2379 \
--advertise-client-urls http://localhost:2379
Tip: P2P handshake is the easiest way to get started — it is decentralized and requires no etcd/Redis/HTTP metadata service. Prefer it for development and simple deployments; use an external etcd/Redis for large, long-lived clusters.
2. Start the Master Service#
mooncake_master \
--enable_http_metadata_server=true \
--http_metadata_server_host=0.0.0.0 \
--http_metadata_server_port=8080
On success the log shows:
Starting Mooncake Master Service
Port: 50051
Max threads: 4
Master service listening on 0.0.0.0:50051
The master’s default RPC port is 50051. The HTTP metadata server serves on port 8080 in this example.
3. Start a Store Client#
Use the Python sample program to bring up a client that contributes 3.2 GB of DRAM to the cluster:
# stress_cluster_benchmark.py
import os
from distributed_object_store import DistributedObjectStore
store = DistributedObjectStore()
store.setup(
local_hostname=os.getenv("LOCAL_HOSTNAME", "localhost"),
metadata_server=os.getenv("METADATA_ADDR", "http://127.0.0.1:8080/metadata"),
global_segment_size=3200 * 1024 * 1024, # DRAM contributed to the cluster
local_buffer_size=512 * 1024 * 1024, # Transfer Engine buffer
protocol=os.getenv("PROTOCOL", "tcp"),
device_name=os.getenv("DEVICE_NAME", ""),
master_server_address=os.getenv("MASTER_SERVER", "127.0.0.1:50051"),
)
python3 stress_cluster_benchmark.py
To use P2P handshake instead of an HTTP/etcd metadata service, pass the
literal string P2PHANDSHAKE as metadata_server — no other change is needed:
store.setup(
local_hostname=os.getenv("LOCAL_HOSTNAME", "localhost"),
metadata_server="P2PHANDSHAKE", # decentralized, no metadata service
global_segment_size=3200 * 1024 * 1024,
local_buffer_size=512 * 1024 * 1024,
protocol=os.getenv("PROTOCOL", "tcp"),
device_name=os.getenv("DEVICE_NAME", ""),
master_server_address=os.getenv("MASTER_SERVER", "127.0.0.1:50051"),
)
The standalone store service accepts the same value:
python -m mooncake.mooncake_store_service \
--local_hostname=localhost \
--metadata_server=P2PHANDSHAKE \
--master_server=127.0.0.1:50051
What just happened:
The client registered itself with the master via RPC.
The master allocated a 3.2 GB segment on this node and added it to the cluster’s memory pool.
The client is now ready to serve
Put/Get/Removerequests.
You can also deploy a standalone store service process that hosts memory/SSD without an inference application:
python -m mooncake.mooncake_store_service \
--local_hostname=localhost \
--metadata_server=http://127.0.0.1:8080/metadata \
--master_server=127.0.0.1:50051
Run the Stress Benchmark#
Mooncake Store includes sample programs for validating C++ and Python integrations. The stress benchmark script can be used to verify a two-role prefill/decode setup.
Configure the script for your network:
local_hostname: the local machine’s reachable IP address or hostname.metadata_server: the Transfer Engine metadata service, for exampleP2PHANDSHAKE,http://127.0.0.1:8080/metadata, or an etcd address.master_server_address: the Mooncake Store master address. UseIP:Portin default mode, oretcd://IP:Port;IP:Port;...;IP:Portin etcd-backed HA mode.
Then start the roles:
ROLE=prefill python3 mooncake-store/tests/stress_cluster_benchmark.py
ROLE=decode python3 mooncake-store/tests/stress_cluster_benchmark.py
For RDMA, topology auto-discovery and NIC filters can be passed through environment variables:
ROLE=prefill MC_MS_AUTO_DISC=1 MC_MS_FILTERS="mlx5_1,mlx5_2" python3 mooncake-store/tests/stress_cluster_benchmark.py
ROLE=decode MC_MS_AUTO_DISC=1 MC_MS_FILTERS="mlx5_1,mlx5_2" python3 mooncake-store/tests/stress_cluster_benchmark.py
The absence of errors indicates successful data transfer.
Standalone Real Client via RPC#
To run a resource-owning real client as a standalone RPC process:
mooncake_client \
--global_segment_size="4GB" \
--master_server_address="localhost:50051" \
--metadata_server="http://localhost:8080/metadata"
The real client connects to the master and listens on port 50052 by default. Application processes such as vLLM or SGLang can then use dummy clients to forward requests to this real client.
Common mooncake_client flags:
Flag |
Default |
Description |
|---|---|---|
|
|
Client service bind host |
|
|
Client service listen port |
|
|
Global segment size contributed by the client |
|
|
Master service address |
|
|
Transfer Engine metadata service |
|
|
Transfer protocol |
|
empty |
Transfer device name |
|
|
Client worker thread count |
Standalone Real Client via HTTP#
Use python -m mooncake.mooncake_store_service to start a real client with a lightweight HTTP API for manual Get and Put debugging.
Create a JSON config:
{
"local_hostname": "localhost",
"metadata_server": "http://localhost:8080/metadata",
"global_segment_size": 268435456,
"local_buffer_size": 268435456,
"protocol": "tcp",
"device_name": "",
"master_server_address": "localhost:50051"
}
Start the service:
python -m mooncake.mooncake_store_service --config=<config_path> --port=8081
The main startup parameters are --config, the path to the JSON configuration file, and --port, the HTTP server port.
Verify Installed Examples#
For a Python integration check, run mooncake-store/tests/distributed_object_store_provider.py after starting the metadata service and mooncake_master.
For a C++ integration check, run build/mooncake-store/tests/client_integration_test after building tests and starting the required services.
Verify#
# Health check — master metrics endpoint
curl -s http://localhost:9003/metrics/summary
# List registered clients
# (exposed through the store's Python API or RPC)
Deployment Scenarios#
Single-Node (TCP) — Development / Quick Evaluation#
The simplest deployment, as shown in Quick Start. A single mooncake_master orchestrates clients over TCP. Suitable for development, testing, and single-host evaluation.
mooncake_master \
--enable_http_metadata_server=true \
--http_metadata_server_host=0.0.0.0 \
--http_metadata_server_port=8080
Limitation: the master is a single point of failure. If it crashes, cluster operations pause until it is restored.
High-Availability (etcd) — Production HA#
Runs a cluster of master instances coordinated through etcd. If the leader fails, the remaining instances elect a new leader automatically.
# Start each master instance with:
mooncake_master \
--enable-ha=true \
--etcd-endpoints="10.0.0.1:2379;10.0.0.2:2379;10.0.0.3:2379" \
--rpc-address=10.0.0.1
Each instance must specify its own reachable --rpc-address. The etcd cluster used for HA can be shared with or separate from the Transfer Engine’s metadata etcd.
High-Availability (Redis) — Alternative HA Backend#
Same HA semantics but using Redis instead of etcd for leader election:
mooncake_master \
--enable-ha=true \
--ha_backend_type=redis \
--ha_backend_connstring="redis://127.0.0.1:6379" \
--rpc-address=10.0.0.1
Snapshot & Restore — Backup / Disaster Recovery#
Caution
Metadata Snapshot And Restore is experimental feature.
Periodically persist master metadata to local disk or S3, enabling recovery from a recent snapshot after a crash.
export MOONCAKE_SNAPSHOT_LOCAL_PATH=/data/mooncake_snapshots
mooncake_master \
--enable_snapshot=true \
--snapshot_interval_seconds=300 \
--snapshot_retention_count=5 \
--snapshot_object_store_type=local \
--enable_snapshot_restore=true
Tiered Storage with SSD Offload — Cost-Effective Capacity#
Extends the cache pool from DRAM to SSD while keeping normal reads and writes on the distributed memory path. With --enable_offload=true, completed memory writes are queued for asynchronous SSD persistence through the master control plane. Set --offload_on_evict=true to defer that SSD write until the memory eviction path selects an object for reclamation. When --promotion_on_hit=true, SSD-only objects can be promoted back to DRAM after repeated reads; admission is gated by --promotion_admission_threshold.
mooncake_master \
--enable_offload=true \
--offload_on_evict=true \
--promotion_on_hit=true \
--promotion_admission_threshold=2 \
--root_fs_dir=/mnt/ssd_cache \
--enable_http_metadata_server=true \
--http_metadata_server_port=8080
CXL-Aware Allocation — Memory Tiering#
When the host has CXL-attached memory, the master can preferentially allocate new objects on the CXL tier, reserving local DRAM for latency-sensitive operations.
mooncake_master \
--enable_cxl=true \
--cxl_path=/dev/dax0.0 \
--cxl_size=17179869184 \
--allocation_strategy=cxl
Container / Dynamic Network Interface#
When the master runs in a container with a dynamic IP, use --rpc_interface to resolve the RPC address from a stable interface name:
mooncake_master \
--rpc_interface=eth0 \
--enable_http_metadata_server=true \
--http_metadata_server_host=0.0.0.0 \
--http_metadata_server_port=8080
The master resolves the current IPv4 address of eth0 at startup and uses it as the advertised RPC address.
Metrics Endpoints#
The master exposes Prometheus-style metrics on --metrics_port:
# Prometheus format
curl -s http://<master_host>:9003/metrics
# Human-readable summary
curl -s http://<master_host>:9003/metrics/summary
Quick Tips#
Scale
--rpc_thread_numwith available CPU cores and workload.Start with default eviction settings; adjust
--eviction_high_watermark_ratioand--eviction_ratiobased on memory pressure and object churn.Use
/metrics/summaryduring bring-up; integrate/metricswith Prometheus/Grafana for production.For detailed SSD offload configuration (storage backends, eviction policies, io_uring), see the SSD Offload guide.
For NVMe-oF SSD pool configuration see the NVMe-oF SSD Pool Deployment Guide
For experimental 3FS (USRBIO) integration as a persistent storage backend, see the 3FS USRBIO Plugin guide.
For detailed monitoring and observation see Observability
Reference: Master Startup Flags#
RPC#
Flag |
Default |
Description |
|---|---|---|
|
|
RPC listen port |
|
|
RPC worker threads |
|
|
RPC bind address |
|
empty |
Network interface to resolve RPC address at startup (overrides |
|
|
Idle connection timeout; |
|
|
Enable TCP_NODELAY |
Logging#
The master uses glog. When --log_dir is set, all severities are merged into a single journal file in that directory (mooncake_master.INFO.<date>-<time>.<pid>), reachable through the stable mooncake_master.INFO symlink.
glog’s standard flags (--log_dir, --max_log_size, --logtostderr, …) control the rest.
Metrics#
Flag |
Default |
Description |
|---|---|---|
|
|
Periodically log master metrics |
|
|
HTTP port for |
HTTP Metadata Server (Embedded)#
Flag |
Default |
Description |
|---|---|---|
|
|
Enable embedded HTTP metadata server |
|
|
Metadata bind host |
|
|
Metadata TCP port |
Memory Allocator#
Flag |
Default |
Description |
|---|---|---|
|
|
Memory allocator: |
Allocation Strategy#
Flag |
Default |
Description |
|---|---|---|
|
|
|
PutStart Timeouts#
Flag |
Default |
Description |
|---|---|---|
|
|
Seconds before an uncompleted |
|
|
Seconds before |
Eviction & TTLs#
Flag |
Default |
Description |
|---|---|---|
|
|
Lease TTL for KV objects. Supports |
|
|
Soft pin TTL (30 min) |
|
|
Allow evicting soft-pinned objects |
|
|
Fraction evicted at high watermark |
|
|
Usage ratio triggering eviction |
|
|
Seconds before a silent client is considered disconnected |
High Availability#
Master Node High Availability
Flag |
Default |
Description |
|---|---|---|
|
|
Enable HA mode |
|
|
HA backend: |
|
empty |
HA backend connection string |
|
empty |
etcd endpoints, semicolon separated (when |
|
|
Cluster ID for HA persistence |
Caution
Metadata Snapshot And Restore is experimental feature.
Metadata Snapshot And Restore
Flag |
Default |
Description |
|---|---|---|
|
|
Enable periodic metadata snapshot |
|
|
Interval between snapshots |
|
|
Timeout per snapshot child process |
|
|
Number of recent snapshots retained |
|
required |
Object store: |
|
empty |
Catalog store: |
|
empty |
Catalog store connection string (required for |
|
empty |
Optional local backup directory |
|
|
Restore from latest snapshot at startup |
Environment variable: MOONCAKE_SNAPSHOT_LOCAL_PATH (required when --snapshot_object_store_type=local) — persistent directory for local snapshots.
Warning
The snapshot storage path is a managed directory exclusively controlled by Mooncake. Old snapshots exceeding --snapshot_retention_count are automatically deleted. Use a dedicated directory to avoid data loss.
Task Manager#
Flag |
Default |
Description |
|---|---|---|
|
|
Max finished tasks kept in memory |
|
|
Max queued pending tasks |
|
|
Max simultaneously processing tasks |
|
|
Timeout for pending tasks ( |
|
|
Timeout for processing tasks ( |
|
|
Max retries for failed tasks ( |
Offload / Tiered Storage#
Flags for controlling data movement between DRAM and SSD.
Flag |
Default |
Description |
|---|---|---|
|
|
Enable offload from DRAM to SSD |
|
|
Defer offload to eviction time rather than at |
|
|
Force-evict objects exceeding capacity without offload |
|
|
Promote SSD-resident keys to DRAM on read hit |
|
|
Min CountMinSketch count to allow promotion ( |
|
|
Max in-flight promotion tasks |
|
|
Storage quota in bytes |
|
|
Enable disk eviction |
Start with --enable_offload=true for eager asynchronous SSD persistence after Put completion. Add --offload_on_evict=true when you want SSD writes to happen only when memory pressure selects an object for eviction. Add --promotion_on_hit=true to allow hot SSD-only data to be promoted back to DRAM, and tune --promotion_admission_threshold to control how many observed reads are required before promotion is queued.
CXL Memory#
Flag |
Default |
Description |
|---|---|---|
|
|
Enable CXL memory support |
|
|
DAX device path for CXL memory |
|
|
CXL memory size in bytes |
When --allocation_strategy=cxl is set alongside --enable_cxl=true, the master preferentially allocates new objects on CXL memory.
DFS Storage#
Flag |
Default |
Description |
|---|---|---|
|
empty |
DFS mount directory for multi-layer storage backend |
|
|
Max available space for DFS segments |
Master Configuration File#
In addition to CLI flags, the master accepts JSON/YAML config files:
mooncake_master --config_path=mooncake-store/conf/master.yaml
rpc_interface: "eth0"
rpc_port: 50051
Reference: Client & Engine Tuning (Env Vars)#
Runtime Protocol#
Variable |
Default |
Description |
|---|---|---|
|
|
RPC transport protocol between master and clients: |
|
unset |
Set to any value to enable the TENT (next-gen) transfer engine |
|
unset |
Cluster ID label attached to client metrics |
Topology Discovery#
Variable |
Default |
Description |
|---|---|---|
|
|
Auto-discover NIC/GPU topology. Set |
|
empty |
Comma-separated NIC whitelist (e.g., |
When MC_MS_AUTO_DISC=0, pass rdma_devices (comma-separated) to the Python setup() call.
Transfer Engine Metrics (disabled by default)#
Variable |
Default |
Description |
|---|---|---|
|
|
Set to |
|
|
Seconds between reports |
Client Metrics (enabled by default)#
Variable |
Default |
Description |
|---|---|---|
|
|
Set |
|
|
Reporting interval; |
|
|
Min local port for client connections |
|
|
Max local port for client connections |
Local Hot Cache#
Local hot cache provides a DRAM read cache on top of SSD-resident objects for faster access.
Variable |
Default |
Description |
|---|---|---|
|
unset |
Size of the local hot cache (e.g., |
|
unset |
Block size for hot cache (e.g., |
|
unset |
Set |
|
unset |
Minimum CountMinSketch count before a key is admitted to hot cache |
Local Memory Optimization#
Variable |
Default |
Description |
|---|---|---|
|
|
Set |
|
|
Number of times to retry client registration on failure |
|
unset |
CXL device size (overrides |
MMap Buffer & HugePages#
Variable |
Default |
Description |
|---|---|---|
|
unset |
Set |
|
|
Supported: |
|
unset |
Pre-allocated arena pool size (e.g., |
|
unset |
Set |
yalantinglibs Log Level#
export MC_YLT_LOG_LEVEL=info
Available: trace, debug, info, warn (or warning), error, critical.