Observability#
This document describes how to monitor and observe a running Mooncake Store deployment.
Master Metrics Log#
When started, the Mooncake master periodically prints a metrics summary log every 10 seconds (configurable via kMetricReportIntervalSeconds). This log provides a comprehensive snapshot of the master’s runtime state.
Log Format#
I0512 15:03:30.321475 239489 rpc_service.cpp:269] Master Admin Metrics: role=leader, state=serving, service_ready=true, master={...}, ha={...}, leader=127.0.0.1:50051, view_version=1
Each log line contains:
GLog header: timestamp, thread ID, source file and line number
role: HA role —
leaderorstandbystate: HA runtime state —
serving,starting,stopping, etc.service_ready: whether the gRPC service is accepting requests
master: master metrics block (see below)
ha: HA metrics block
leader: (only when available) the current leader address and view version
Master Metrics Block#
A typical master={...} block looks like this:
Mem Storage: 94.09 MB / 100.00 MB (94.1%) | SSD Storage: 0 B / 0 B | Keys: 16058 (soft-pinned: 0) | Clients: 1 | Requests (Success/Total per sec): PutStart=0.00/0.00, PutEnd=0.00/0.00, PutRevoke=0.00/0.00, Get=0.00/0.00, Exist=0.00/0.00, Del=0.00/0.00, DelAll=0.00/0.00, Ping=1.00/1.00, CopyStart=0.00/0.00, CopyEnd=0.00/0.00, CopyRevoke=0.00/0.00, MoveStart=0.00/0.00, MoveEnd=0.00/0.00, MoveRevoke=0.00/0.00, EvictDiskReplica=0.00/0.00 | Batch Requests (per sec, Req=Success/PartialSuccess/Total, Item=Success/Total): PutStart:(Req=0.00/0.00/0.00, Item=0.00/0.00), PutEnd:(Req=0.00/0.00/0.00, Item=0.00/0.00), PutRevoke:(Req=0.00/0.00/0.00, Item=0.00/0.00), Get:(Req=0.00/0.00/0.00, Item=0.00/0.00), ExistKey:(Req=0.00/0.00/0.00, Item=0.00/0.00), QueryIp:(Req=0.00/0.00/0.00, Item=0.00/0.00), Clear:(Req=0.00/0.00/0.00, Item=0.00/0.00), CreateMoveTask:(Req=0.00/0.00), CreateCopyTask:(Req=0.00/0.00), QueryTask:(Req=0.00/0.00), FetchTasks:(Req=0.00/0.00), MarkTaskToComplete:(Req=0.00/0.00) | Eviction: Success/Attempts=0/0, AllocFail=0, keys=0, size=0 B | Discard: Released/Total=0/0, StagingSize=0 B | Snapshots: Success=0, Fail=0
Request counters are reported as rates per second over the time window between two consecutive log outputs (10 seconds by default). Real-time state values (storage, key count, client count, discard staging size) are not rate-limited and reflect the current value at log time.
The metrics block consists of the following sections:
Storage#
Field |
Description |
|---|---|
|
Current memory usage / total memory capacity, with percentage |
|
Current SSD-backed storage usage / total SSD capacity |
Keys and Clients#
Field |
Description |
|---|---|
|
Total number of keys managed by the master |
|
Number of keys with active soft-pin leases (protected from eviction) |
|
Number of currently connected clients |
Requests (Success/Total per sec)#
Rate counters for individual (non-batch) RPC requests over the last time window. Each shows <success_rate>/<total_rate> in requests per second:
Counter |
Description |
|---|---|
|
Put object allocation requests |
|
Put object commit requests |
|
Put object cancellation requests |
|
Get replica list requests |
|
Key existence check requests |
|
Single key deletion requests |
|
Delete-all objects requests |
|
Client heartbeat/ping requests |
|
Copy object allocation requests |
|
Copy object commit requests |
|
Copy object cancellation requests |
|
Move object allocation requests |
|
Move object commit requests |
|
Move object cancellation requests |
|
Evict disk replica requests |
Batch Requests (per sec)#
Batch operations aggregate multiple items into a single RPC. Rates are per second over the last time window. Format: Req=<success>/<partial_success>/<total>, Item=<success_items>/<total_items>:
Counter |
Description |
|---|---|
|
Batch put object allocation requests |
|
Batch put object commit requests |
|
Batch put object cancellation requests |
|
Batch get replica list requests |
|
Batch key existence check requests |
|
Batch query IP requests |
|
Batch replica clear requests |
A request is considered “partial success” when it succeeds for some items but not all.
Task Operations#
Counter |
Description |
|---|---|
|
Move task creation requests |
|
Copy task creation requests |
|
Task status query requests |
|
Pending task fetch requests (polled by store clients) |
|
Task completion acknowledgement requests |
Eviction & Discard#
Eviction counters are deltas between two consecutive log outputs — they show what happened in the time window, not cumulative totals.
Field |
Description |
|---|---|
|
Eviction rounds that succeeded at least partially vs. total attempts in this window |
|
Number of PutStart/UpsertStart failures caused by replica allocation failure (triggers eviction) in this window |
|
Number of keys evicted in this window |
|
Total size of evicted data in this window |
|
Released (cleaned up) vs. total discarded PutStart staging replicas (live values) |
|
Current size of discarded but not-yet-released staging buffers (live value) |
Prometheus Metrics Endpoint#
Mooncake master exposes Prometheus-format metrics at the HTTP admin endpoint. This allows integration with Prometheus, Grafana, or any Prometheus-compatible monitoring stack.
Endpoints#
The admin HTTP server runs on metrics_port (default: 9003) and exposes the following endpoints:
Endpoint |
Content-Type |
Description |
|---|---|---|
|
|
All metrics in Prometheus exposition format |
|
|
Human-readable summary (same content as the periodic log) |
|
|
Health check with role, HA state, and service readiness |
|
|
Current HA role ( |
|
|
Current HA runtime state ( |
Usage#
Scrape the /metrics endpoint with Prometheus:
Add a scrape config to your prometheus.yml:
scrape_configs:
- job_name: 'mooncake-master'
static_configs:
- targets: ['<master-host>:9003']
metrics_path: '/metrics'
Quick check with curl:
# Get Prometheus metrics
curl http://<master-host>:9003/metrics
# Get human-readable summary
curl http://<master-host>:9003/metrics/summary
# Check health
curl http://<master-host>:9003/health
Configuration#
The admin HTTP server is configured in the master config file (master.json or master.yaml):
{
"enable_metric_reporting": true,
"metrics_port": 9003,
...
}
Set enable_metric_reporting to false to disable the periodic metrics log. HTTP endpoints (/metrics, /health, etc.) remain available regardless of this setting.