Mooncake Store Deployment & Tuning Guide#

This guide covers minimal deployment, and operational tuning of Mooncake Store.

Architecture Overview#

architecture

Master Service (mooncake_master): The central coordinator. It manages cluster membership, allocates object storage across client nodes, and enforces eviction/placement policies. Runs as a standalone process.

Client Node: Each node contributes DRAM (and optionally VRAM/SSD) to form the distributed cache pool. Clients communicate with the master over RPC for control operations (Put/Get/Remove), but transfer actual data directly between each other via the Transfer Engine — the master is never in the data path.

Metadata Service: A separate service (etcd, Redis, or HTTP) used by the Transfer Engine for peer discovery and configuration. The master’s embedded HTTP metadata server can replace an external etcd/Redis for simple deployments. We also provide a P2P handshake mechanism (P2PHANDSHAKE) that enables decentralized metadata management by storing metadata locally on each node, eliminating the need for a centralized service — this is the simplest metadata handshake method and the recommended starting point (see Quick Start).

For a detailed design discussion, see the Mooncake Store Design.


Quick Start#

Deploy a minimal single-node Mooncake Store in three steps.

1. Start the Metadata Service#

Choose one option:

# Option A: P2P handshake — the simplest metadata handshake method (recommended)
# Nothing to start here. There is no metadata service to deploy: each node
# exchanges and stores metadata locally during connection setup. Just set the
# client's metadata_server to the literal string "P2PHANDSHAKE" (see step 3).

# Option B: Embed HTTP metadata server in the master
# (configured in step 2 via --enable_http_metadata_server)

# Option C: External etcd
etcd --listen-client-urls http://0.0.0.0:2379 \
     --advertise-client-urls http://localhost:2379

Tip: P2P handshake is the easiest way to get started — it is decentralized and requires no etcd/Redis/HTTP metadata service. Prefer it for development and simple deployments; use an external etcd/Redis for large, long-lived clusters.

2. Start the Master Service#

mooncake_master \
  --enable_http_metadata_server=true \
  --http_metadata_server_host=0.0.0.0 \
  --http_metadata_server_port=8080

On success the log shows:

Starting Mooncake Master Service
Port: 50051
Max threads: 4
Master service listening on 0.0.0.0:50051

The master’s default RPC port is 50051. The HTTP metadata server serves on port 8080 in this example.

3. Start a Store Client#

Use the Python sample program to bring up a client that contributes 3.2 GB of DRAM to the cluster:

# stress_cluster_benchmark.py
import os
from distributed_object_store import DistributedObjectStore

store = DistributedObjectStore()
store.setup(
    local_hostname=os.getenv("LOCAL_HOSTNAME", "localhost"),
    metadata_server=os.getenv("METADATA_ADDR", "http://127.0.0.1:8080/metadata"),
    global_segment_size=3200 * 1024 * 1024,  # DRAM contributed to the cluster
    local_buffer_size=512 * 1024 * 1024,     # Transfer Engine buffer
    protocol=os.getenv("PROTOCOL", "tcp"),
    device_name=os.getenv("DEVICE_NAME", ""),
    master_server_address=os.getenv("MASTER_SERVER", "127.0.0.1:50051"),
)
python3 stress_cluster_benchmark.py

To use P2P handshake instead of an HTTP/etcd metadata service, pass the literal string P2PHANDSHAKE as metadata_server — no other change is needed:

store.setup(
    local_hostname=os.getenv("LOCAL_HOSTNAME", "localhost"),
    metadata_server="P2PHANDSHAKE",          # decentralized, no metadata service
    global_segment_size=3200 * 1024 * 1024,
    local_buffer_size=512 * 1024 * 1024,
    protocol=os.getenv("PROTOCOL", "tcp"),
    device_name=os.getenv("DEVICE_NAME", ""),
    master_server_address=os.getenv("MASTER_SERVER", "127.0.0.1:50051"),
)

The standalone store service accepts the same value:

python -m mooncake.mooncake_store_service \
  --local_hostname=localhost \
  --metadata_server=P2PHANDSHAKE \
  --master_server=127.0.0.1:50051

What just happened:

  1. The client registered itself with the master via RPC.

  2. The master allocated a 3.2 GB segment on this node and added it to the cluster’s memory pool.

  3. The client is now ready to serve Put/Get/Remove requests.

You can also deploy a standalone store service process that hosts memory/SSD without an inference application:

python -m mooncake.mooncake_store_service \
  --local_hostname=localhost \
  --metadata_server=http://127.0.0.1:8080/metadata \
  --master_server=127.0.0.1:50051

Run the Stress Benchmark#

Mooncake Store includes sample programs for validating C++ and Python integrations. The stress benchmark script can be used to verify a two-role prefill/decode setup.

Configure the script for your network:

  • local_hostname: the local machine’s reachable IP address or hostname.

  • metadata_server: the Transfer Engine metadata service, for example P2PHANDSHAKE, http://127.0.0.1:8080/metadata, or an etcd address.

  • master_server_address: the Mooncake Store master address. Use IP:Port in default mode, or etcd://IP:Port;IP:Port;...;IP:Port in etcd-backed HA mode.

Then start the roles:

ROLE=prefill python3 mooncake-store/tests/stress_cluster_benchmark.py
ROLE=decode python3 mooncake-store/tests/stress_cluster_benchmark.py

For RDMA, topology auto-discovery and NIC filters can be passed through environment variables:

ROLE=prefill MC_MS_AUTO_DISC=1 MC_MS_FILTERS="mlx5_1,mlx5_2" python3 mooncake-store/tests/stress_cluster_benchmark.py
ROLE=decode MC_MS_AUTO_DISC=1 MC_MS_FILTERS="mlx5_1,mlx5_2" python3 mooncake-store/tests/stress_cluster_benchmark.py

The absence of errors indicates successful data transfer.

Standalone Real Client via RPC#

To run a resource-owning real client as a standalone RPC process:

mooncake_client \
  --global_segment_size="4GB" \
  --master_server_address="localhost:50051" \
  --metadata_server="http://localhost:8080/metadata"

The real client connects to the master and listens on port 50052 by default. Application processes such as vLLM or SGLang can then use dummy clients to forward requests to this real client.

Common mooncake_client flags:

Flag

Default

Description

--host

0.0.0.0

Client service bind host

--port

50052

Client service listen port

--global_segment_size

4GB

Global segment size contributed by the client

--master_server_address

localhost:50051

Master service address

--metadata_server

http://localhost:8080/metadata

Transfer Engine metadata service

--protocol

tcp

Transfer protocol

--device_name

empty

Transfer device name

--threads

1

Client worker thread count

Standalone Real Client via HTTP#

Use python -m mooncake.mooncake_store_service to start a real client with a lightweight HTTP API for manual Get and Put debugging.

Create a JSON config:

{
  "local_hostname": "localhost",
  "metadata_server": "http://localhost:8080/metadata",
  "global_segment_size": 268435456,
  "local_buffer_size": 268435456,
  "protocol": "tcp",
  "device_name": "",
  "master_server_address": "localhost:50051"
}

Start the service:

python -m mooncake.mooncake_store_service --config=<config_path> --port=8081

The main startup parameters are --config, the path to the JSON configuration file, and --port, the HTTP server port.

Verify Installed Examples#

For a Python integration check, run mooncake-store/tests/distributed_object_store_provider.py after starting the metadata service and mooncake_master.

For a C++ integration check, run build/mooncake-store/tests/client_integration_test after building tests and starting the required services.

Verify#

# Health check — master metrics endpoint
curl -s http://localhost:9003/metrics/summary

# List registered clients
# (exposed through the store's Python API or RPC)

Deployment Scenarios#

Single-Node (TCP) — Development / Quick Evaluation#

The simplest deployment, as shown in Quick Start. A single mooncake_master orchestrates clients over TCP. Suitable for development, testing, and single-host evaluation.

mooncake_master \
  --enable_http_metadata_server=true \
  --http_metadata_server_host=0.0.0.0 \
  --http_metadata_server_port=8080

Limitation: the master is a single point of failure. If it crashes, cluster operations pause until it is restored.


High-Availability (etcd) — Production HA#

Runs a cluster of master instances coordinated through etcd. If the leader fails, the remaining instances elect a new leader automatically.

# Start each master instance with:
mooncake_master \
  --enable-ha=true \
  --etcd-endpoints="10.0.0.1:2379;10.0.0.2:2379;10.0.0.3:2379" \
  --rpc-address=10.0.0.1

Each instance must specify its own reachable --rpc-address. The etcd cluster used for HA can be shared with or separate from the Transfer Engine’s metadata etcd.


High-Availability (Redis) — Alternative HA Backend#

Same HA semantics but using Redis instead of etcd for leader election:

mooncake_master \
  --enable-ha=true \
  --ha_backend_type=redis \
  --ha_backend_connstring="redis://127.0.0.1:6379" \
  --rpc-address=10.0.0.1

Snapshot & Restore — Backup / Disaster Recovery#

Caution

Metadata Snapshot And Restore is experimental feature.

Periodically persist master metadata to local disk or S3, enabling recovery from a recent snapshot after a crash.

export MOONCAKE_SNAPSHOT_LOCAL_PATH=/data/mooncake_snapshots

mooncake_master \
  --enable_snapshot=true \
  --snapshot_interval_seconds=300 \
  --snapshot_retention_count=5 \
  --snapshot_object_store_type=local \
  --enable_snapshot_restore=true

Tiered Storage with SSD Offload — Cost-Effective Capacity#

Extends the cache pool from DRAM to SSD while keeping normal reads and writes on the distributed memory path. With --enable_offload=true, completed memory writes are queued for asynchronous SSD persistence through the master control plane. Set --offload_on_evict=true to defer that SSD write until the memory eviction path selects an object for reclamation. When --promotion_on_hit=true, SSD-only objects can be promoted back to DRAM after repeated reads; admission is gated by --promotion_admission_threshold.

mooncake_master \
  --enable_offload=true \
  --offload_on_evict=true \
  --promotion_on_hit=true \
  --promotion_admission_threshold=2 \
  --root_fs_dir=/mnt/ssd_cache \
  --enable_http_metadata_server=true \
  --http_metadata_server_port=8080

CXL-Aware Allocation — Memory Tiering#

When the host has CXL-attached memory, the master can preferentially allocate new objects on the CXL tier, reserving local DRAM for latency-sensitive operations.

mooncake_master \
  --enable_cxl=true \
  --cxl_path=/dev/dax0.0 \
  --cxl_size=17179869184 \
  --allocation_strategy=cxl

Container / Dynamic Network Interface#

When the master runs in a container with a dynamic IP, use --rpc_interface to resolve the RPC address from a stable interface name:

mooncake_master \
  --rpc_interface=eth0 \
  --enable_http_metadata_server=true \
  --http_metadata_server_host=0.0.0.0 \
  --http_metadata_server_port=8080

The master resolves the current IPv4 address of eth0 at startup and uses it as the advertised RPC address.


Metrics Endpoints#

The master exposes Prometheus-style metrics on --metrics_port:

# Prometheus format
curl -s http://<master_host>:9003/metrics

# Human-readable summary
curl -s http://<master_host>:9003/metrics/summary

Quick Tips#

  • Scale --rpc_thread_num with available CPU cores and workload.

  • Start with default eviction settings; adjust --eviction_high_watermark_ratio and --eviction_ratio based on memory pressure and object churn.

  • Use /metrics/summary during bring-up; integrate /metrics with Prometheus/Grafana for production.

  • For detailed SSD offload configuration (storage backends, eviction policies, io_uring), see the SSD Offload guide.

  • For NVMe-oF SSD pool configuration see the NVMe-oF SSD Pool Deployment Guide

  • For experimental 3FS (USRBIO) integration as a persistent storage backend, see the 3FS USRBIO Plugin guide.

  • For detailed monitoring and observation see Observability


Reference: Master Startup Flags#

RPC#

Flag

Default

Description

--rpc_port

50051

RPC listen port

--rpc_thread_num

min(4, CPU cores)

RPC worker threads

--rpc_address

0.0.0.0

RPC bind address

--rpc_interface

empty

Network interface to resolve RPC address at startup (overrides --rpc_address)

--rpc_conn_timeout_seconds

0

Idle connection timeout; 0 disables

--rpc_enable_tcp_no_delay

true

Enable TCP_NODELAY

Logging#

The master uses glog. When --log_dir is set, all severities are merged into a single journal file in that directory (mooncake_master.INFO.<date>-<time>.<pid>), reachable through the stable mooncake_master.INFO symlink.

glog’s standard flags (--log_dir, --max_log_size, --logtostderr, …) control the rest.

Metrics#

Flag

Default

Description

--enable_metric_reporting

true

Periodically log master metrics

--metrics_port

9003

HTTP port for /metrics endpoints

HTTP Metadata Server (Embedded)#

Flag

Default

Description

--enable_http_metadata_server

false

Enable embedded HTTP metadata server

--http_metadata_server_host

0.0.0.0

Metadata bind host

--http_metadata_server_port

8080

Metadata TCP port

Memory Allocator#

Flag

Default

Description

--memory_allocator

offset

Memory allocator: offset (default) or cachelib

Allocation Strategy#

Flag

Default

Description

--allocation_strategy

random

random (pure random, fastest), free_ratio_first (best load balance), or cxl (prefer CXL memory)

PutStart Timeouts#

Flag

Default

Description

--put_start_discard_timeout_sec

30

Seconds before an uncompleted PutStart is discarded

--put_start_release_timeout_sec

300 (5 min)

Seconds before PutStart-allocated space is released

Eviction & TTLs#

Flag

Default

Description

--default_kv_lease_ttl

5000 ms

Lease TTL for KV objects. Supports 5000ms, 5s, 30m, 1h

--default_kv_soft_pin_ttl

1800000 ms

Soft pin TTL (30 min)

--allow_evict_soft_pinned_objects

true

Allow evicting soft-pinned objects

--eviction_ratio

0.05

Fraction evicted at high watermark

--eviction_high_watermark_ratio

0.95

Usage ratio triggering eviction

--client_ttl

10 s

Seconds before a silent client is considered disconnected

High Availability#

Master Node High Availability

Flag

Default

Description

--enable_ha

false

Enable HA mode

--ha_backend_type

etcd

HA backend: etcd, redis, or k8s

--ha_backend_connstring

empty

HA backend connection string

--etcd_endpoints

empty

etcd endpoints, semicolon separated (when --ha_backend_type=etcd)

--cluster_id

mooncake_cluster

Cluster ID for HA persistence

Caution

Metadata Snapshot And Restore is experimental feature.

Metadata Snapshot And Restore

Flag

Default

Description

--enable_snapshot

false

Enable periodic metadata snapshot

--snapshot_interval_seconds

300 (5 min)

Interval between snapshots

--snapshot_child_timeout_seconds

3600 (1 hour)

Timeout per snapshot child process

--snapshot_retention_count

10

Number of recent snapshots retained

--snapshot_object_store_type

required

Object store: local or s3

--snapshot_catalog_store_type

empty

Catalog store: embedded or redis

--snapshot_catalog_store_connstring

empty

Catalog store connection string (required for redis)

--snapshot_backup_dir

empty

Optional local backup directory

--enable_snapshot_restore

false

Restore from latest snapshot at startup

Environment variable: MOONCAKE_SNAPSHOT_LOCAL_PATH (required when --snapshot_object_store_type=local) — persistent directory for local snapshots.

Warning

The snapshot storage path is a managed directory exclusively controlled by Mooncake. Old snapshots exceeding --snapshot_retention_count are automatically deleted. Use a dedicated directory to avoid data loss.

Task Manager#

Flag

Default

Description

--max_total_finished_tasks

10000

Max finished tasks kept in memory

--max_total_pending_tasks

10000

Max queued pending tasks

--max_total_processing_tasks

10000

Max simultaneously processing tasks

--pending_task_timeout_sec

300 (5 min)

Timeout for pending tasks (0 = no timeout)

--processing_task_timeout_sec

300 (5 min)

Timeout for processing tasks (0 = no timeout)

--max_retry_attempts

10

Max retries for failed tasks (NO_AVAILABLE_HANDLE)

Offload / Tiered Storage#

Flags for controlling data movement between DRAM and SSD.

Flag

Default

Description

--enable_offload

false

Enable offload from DRAM to SSD

--offload_on_evict

false

Defer offload to eviction time rather than at Put

--offload_force_evict

false

Force-evict objects exceeding capacity without offload

--promotion_on_hit

false

Promote SSD-resident keys to DRAM on read hit

--promotion_admission_threshold

2

Min CountMinSketch count to allow promotion (1 = disable gating)

--promotion_queue_limit

50000

Max in-flight promotion tasks

--quota_bytes

0 (90% of capacity)

Storage quota in bytes

--enable_disk_eviction

true

Enable disk eviction

Start with --enable_offload=true for eager asynchronous SSD persistence after Put completion. Add --offload_on_evict=true when you want SSD writes to happen only when memory pressure selects an object for eviction. Add --promotion_on_hit=true to allow hot SSD-only data to be promoted back to DRAM, and tune --promotion_admission_threshold to control how many observed reads are required before promotion is queued.

CXL Memory#

Flag

Default

Description

--enable_cxl

false

Enable CXL memory support

--cxl_path

/dev/dax0.0

DAX device path for CXL memory

--cxl_size

8GB (8589934592)

CXL memory size in bytes

When --allocation_strategy=cxl is set alongside --enable_cxl=true, the master preferentially allocates new objects on CXL memory.

DFS Storage#

Flag

Default

Description

--root_fs_dir

empty

DFS mount directory for multi-layer storage backend

--global_file_segment_size

4GB (4294967296)

Max available space for DFS segments

Master Configuration File#

In addition to CLI flags, the master accepts JSON/YAML config files:

mooncake_master --config_path=mooncake-store/conf/master.yaml
rpc_interface: "eth0"
rpc_port: 50051

Reference: Client & Engine Tuning (Env Vars)#

Runtime Protocol#

Variable

Default

Description

MC_RPC_PROTOCOL

tcp

RPC transport protocol between master and clients: tcp or rdma

MC_USE_TENT / MC_USE_TEV1

unset

Set to any value to enable the TENT (next-gen) transfer engine

MC_STORE_CLUSTER_ID

unset

Cluster ID label attached to client metrics

Topology Discovery#

Variable

Default

Description

MC_MS_AUTO_DISC

1

Auto-discover NIC/GPU topology. Set 0 to provide rdma_devices manually

MC_MS_FILTERS

empty

Comma-separated NIC whitelist (e.g., mlx5_0,mlx5_2)

When MC_MS_AUTO_DISC=0, pass rdma_devices (comma-separated) to the Python setup() call.

Transfer Engine Metrics (disabled by default)#

Variable

Default

Description

MC_TE_METRIC

0

Set to 1 to enable engine metrics. Not supported with TENT

MC_TE_METRIC_INTERVAL_SECONDS

5

Seconds between reports

Client Metrics (enabled by default)#

Variable

Default

Description

MC_STORE_CLIENT_METRIC

1

Set 0 to disable

MC_STORE_CLIENT_METRIC_INTERVAL

0

Reporting interval; 0 collects but does not periodically report

MC_STORE_CLIENT_MIN_PORT

12300

Min local port for client connections

MC_STORE_CLIENT_MAX_PORT

14300

Max local port for client connections

Local Hot Cache#

Local hot cache provides a DRAM read cache on top of SSD-resident objects for faster access.

Variable

Default

Description

MC_STORE_LOCAL_HOT_CACHE_SIZE

unset

Size of the local hot cache (e.g., "8gb"). Set to enable the hot cache

MC_STORE_LOCAL_HOT_BLOCK_SIZE

unset

Block size for hot cache (e.g., "2mb")

MC_STORE_LOCAL_HOT_CACHE_USE_SHM

unset

Set 1 to use memfd-backed shared memory

MC_STORE_LOCAL_HOT_ADMISSION_THRESHOLD

unset

Minimum CountMinSketch count before a key is admitted to hot cache

Local Memory Optimization#

Variable

Default

Description

MC_STORE_MEMCPY

0

Set 1 to prefer local memcpy when source/destination are on the same client

MC_STORE_CLIENT_SETUP_RETRIES

20

Number of times to retry client registration on failure

MC_CXL_DEV_SIZE

unset

CXL device size (overrides --cxl_size for client-side allocation)

MMap Buffer & HugePages#

Variable

Default

Description

MC_STORE_USE_HUGEPAGE

unset

Set 1 to request HugeTLB-backed mmap()

MC_STORE_HUGEPAGE_SIZE

2MB

Supported: 2MB, 1GB

MC_MMAP_ARENA_POOL_SIZE

unset

Pre-allocated arena pool size (e.g., 8gb). Explicitly set to enable the arena

MC_DISABLE_MMAP_ARENA

unset

Set 1 to disable arena, fall back to per-call mmap()

yalantinglibs Log Level#

export MC_YLT_LOG_LEVEL=info

Available: trace, debug, info, warn (or warning), error, critical.