Mooncake Store Deployment & Operations Guide#
This page summarizes useful flags, environment variables, and HTTP endpoints to help advanced users tune Mooncake Master and observe metrics.
Master Startup Flags (with defaults)#
RPC Related
--rpc_port(int, default 50051): RPC listen port.--rpc_thread_num(int, default min(4, CPU cores)): RPC worker threads. If not set, uses--max_threads(default 4) capped by CPU cores.--rpc_address(str, default0.0.0.0): RPC bind address.--rpc_conn_timeout_seconds(int, default0): RPC idle connection timeout;0disables.--rpc_enable_tcp_no_delay(bool, defaulttrue): Enable TCP_NODELAY.
Metrics
--enable_metric_reporting(bool, defaulttrue): Periodically log master metrics to INFO.--metrics_port(int, default9003): HTTP port for/metricsendpoints.
HTTP Metadata Server For Mooncake Transfer Engine
--enable_http_metadata_server(bool, defaultfalse): Enable embedded HTTP metadata server.--http_metadata_server_host(str, default0.0.0.0): Metadata bind host.--http_metadata_server_port(int, default8080): Metadata TCP port.
Eviction and TTLs
--default_kv_lease_ttl(uint64, default5000ms): Default lease TTL for KV objects.--default_kv_soft_pin_ttl(uint64, default1800000ms): Soft pin TTL (30 minutes).--allow_evict_soft_pinned_objects(bool, defaulttrue): Allow evicting soft-pinned objects.--eviction_ratio(double, default0.05): Fraction evicted when hitting high watermark.--eviction_high_watermark_ratio(double, default0.95): Usage ratio to trigger eviction.
High Availability (optional)
--enable_ha(bool, defaultfalse): Enable HA (requires etcd).--etcd_endpoints(str, default empty unless HA config): etcd endpoints, semicolon separated.--client_ttl(int64, default10s): Client alive TTL after last ping (HA mode).--cluster_id(str, defaultmooncake_cluster): Cluster ID for persistence in HA mode.
Example (enable embedded HTTP metadata and metrics):
mooncake_master \
--enable_http_metadata_server=true \
--http_metadata_server_host=0.0.0.0 \
--http_metadata_server_port=8080 \
--rpc_thread_num=64 \
--metrics_port=9003 \
--enable_metric_reporting=true
Tips:
In addition to command-line flags, the Master also supports configuration via JSON and YAML files. For example:
mooncake_master \
--config_path=mooncake-store/conf/master.yaml
Metrics Endpoints#
The master exposes Prometheus-style metrics over HTTP on --metrics_port:
GET /metrics— Prometheus format (text/plain; version=0.0.4).GET /metrics/summary— Human-readable summary.
Examples:
curl -s http://<master_host>:9003/metrics
curl -s http://<master_host>:9003/metrics/summary
Client/Engine Tuning (Env Vars, with defaults)#
Topology discovery (Store Client → Transfer Engine)
MC_MS_AUTO_DISC(default1): Auto-discover NIC/GPU topology. Set0to disable and providerdma_devicesmanually.MC_MS_FILTERS(default empty): Optional comma-separated NIC whitelist when auto-discovery is enabled (e.g.,mlx5_0,mlx5_2).If
MC_MS_AUTO_DISC=0, passrdma_devices(comma-separated) to the Pythonsetup(...)call.
Transfer Engine metrics (disabled by default)
MC_TE_METRIC(default0/unset): Set to1to enable periodic engine metrics logging.MC_TE_METRIC_INTERVAL_SECONDS(default5): Positive integer seconds between reports (effective only if metrics enabled).
Client metrics (enabled by default)
MC_STORE_CLIENT_METRIC(default1): Client-side metrics on by default; set0to disable entirely.MC_STORE_CLIENT_METRIC_INTERVAL(default0): Reporting interval in seconds;0collects but does not periodically report.
Local memcpy optimization (Store transfer path)
MC_STORE_MEMCPY(default0/false): Set to1to prefer local memcpy when source/destination are on the same client.
Quick Tips#
Scale
--rpc_thread_numwith available CPU cores and workload.Start with default eviction settings; adjust
--eviction_high_watermark_ratioand--eviction_ratiobased on memory pressure and object churn.Use
/metrics/summaryduring bring-up; integrate/metricswith Prometheus/Grafana for production.