Mooncake Store Deployment & Operations Guide#
This page summarizes useful flags, environment variables, and HTTP endpoints to help advanced users tune Mooncake Master and observe metrics.
Master Startup Flags (with defaults)#
RPC Related
--rpc_port
(int, default 50051): RPC listen port.--rpc_thread_num
(int, default min(4, CPU cores)): RPC worker threads. If not set, uses--max_threads
(default 4) capped by CPU cores.--rpc_address
(str, default0.0.0.0
): RPC bind address.--rpc_conn_timeout_seconds
(int, default0
): RPC idle connection timeout;0
disables.--rpc_enable_tcp_no_delay
(bool, defaulttrue
): Enable TCP_NODELAY.
Metrics
--enable_metric_reporting
(bool, defaulttrue
): Periodically log master metrics to INFO.--metrics_port
(int, default9003
): HTTP port for/metrics
endpoints.
HTTP Metadata Server For Mooncake Transfer Engine
--enable_http_metadata_server
(bool, defaultfalse
): Enable embedded HTTP metadata server.--http_metadata_server_host
(str, default0.0.0.0
): Metadata bind host.--http_metadata_server_port
(int, default8080
): Metadata TCP port.
Eviction and TTLs
--default_kv_lease_ttl
(uint64, default5000
ms): Default lease TTL for KV objects.--default_kv_soft_pin_ttl
(uint64, default1800000
ms): Soft pin TTL (30 minutes).--allow_evict_soft_pinned_objects
(bool, defaulttrue
): Allow evicting soft-pinned objects.--eviction_ratio
(double, default0.05
): Fraction evicted when hitting high watermark.--eviction_high_watermark_ratio
(double, default0.95
): Usage ratio to trigger eviction.
High Availability (optional)
--enable_ha
(bool, defaultfalse
): Enable HA (requires etcd).--etcd_endpoints
(str, default empty unless HA config): etcd endpoints, semicolon separated.--client_ttl
(int64, default10
s): Client alive TTL after last ping (HA mode).--cluster_id
(str, defaultmooncake_cluster
): Cluster ID for persistence in HA mode.
Example (enable embedded HTTP metadata and metrics):
mooncake_master \
--enable_http_metadata_server=true \
--http_metadata_server_host=0.0.0.0 \
--http_metadata_server_port=8080 \
--rpc_thread_num=64 \
--metrics_port=9003 \
--enable_metric_reporting=true
Metrics Endpoints#
The master exposes Prometheus-style metrics over HTTP on --metrics_port
:
GET /metrics
— Prometheus format (text/plain; version=0.0.4
).GET /metrics/summary
— Human-readable summary.
Examples:
curl -s http://<master_host>:9003/metrics
curl -s http://<master_host>:9003/metrics/summary
Client/Engine Tuning (Env Vars, with defaults)#
Topology discovery (Store Client → Transfer Engine)
MC_MS_AUTO_DISC
(default1
): Auto-discover NIC/GPU topology. Set0
to disable and providerdma_devices
manually.MC_MS_FILTERS
(default empty): Optional comma-separated NIC whitelist when auto-discovery is enabled (e.g.,mlx5_0,mlx5_2
).If
MC_MS_AUTO_DISC=0
, passrdma_devices
(comma-separated) to the Pythonsetup(...)
call.
Transfer Engine metrics (disabled by default)
MC_TE_METRIC
(default0
/unset): Set to1
to enable periodic engine metrics logging.MC_TE_METRIC_INTERVAL_SECONDS
(default5
): Positive integer seconds between reports (effective only if metrics enabled).
Client metrics (enabled by default)
MC_STORE_CLIENT_METRIC
(default1
): Client-side metrics on by default; set0
to disable entirely.MC_STORE_CLIENT_METRIC_INTERVAL
(default0
): Reporting interval in seconds;0
collects but does not periodically report.
Local memcpy optimization (Store transfer path)
MC_STORE_MEMCPY
(default0
/false): Set to1
to prefer local memcpy when source/destination are on the same client.
Quick Tips#
Scale
--rpc_thread_num
with available CPU cores and workload.Start with default eviction settings; adjust
--eviction_high_watermark_ratio
and--eviction_ratio
based on memory pressure and object churn.Use
/metrics/summary
during bring-up; integrate/metrics
with Prometheus/Grafana for production.