Observability

Observability#

This document describes how to monitor and observe a running Mooncake Store deployment.

Master Metrics Log#

When started, the Mooncake master periodically prints a metrics summary log every 10 seconds (configurable via kMetricReportIntervalSeconds). This log provides a comprehensive snapshot of the master’s runtime state.

Log Format#

I0512 15:03:30.321475 239489 rpc_service.cpp:269] Master Admin Metrics: role=leader, state=serving, service_ready=true, master={...}, ha={...}, leader=127.0.0.1:50051, view_version=1

Each log line contains:

GLog header: timestamp, thread ID, source file and line number
role: HA role — leader or standby
state: HA runtime state — serving, starting, stopping, etc.
service_ready: whether the gRPC service is accepting requests
master: master metrics block (see below)
ha: HA metrics block
leader: (only when available) the current leader address and view version

Master Metrics Block#

A typical master={...} block looks like this:

Mem Storage: 94.09 MB / 100.00 MB (94.1%) | SSD Storage: 0 B / 0 B | Keys: 16058 (soft-pinned: 0) | Clients: 1 | Requests (Success/Total per sec): PutStart=0.00/0.00, PutEnd=0.00/0.00, PutRevoke=0.00/0.00, Get=0.00/0.00, Exist=0.00/0.00, Del=0.00/0.00, DelAll=0.00/0.00, Ping=1.00/1.00, CopyStart=0.00/0.00, CopyEnd=0.00/0.00, CopyRevoke=0.00/0.00, MoveStart=0.00/0.00, MoveEnd=0.00/0.00, MoveRevoke=0.00/0.00, EvictDiskReplica=0.00/0.00 | Batch Requests (per sec, Req=Success/PartialSuccess/Total, Item=Success/Total): PutStart:(Req=0.00/0.00/0.00, Item=0.00/0.00), PutEnd:(Req=0.00/0.00/0.00, Item=0.00/0.00), PutRevoke:(Req=0.00/0.00/0.00, Item=0.00/0.00), Get:(Req=0.00/0.00/0.00, Item=0.00/0.00), ExistKey:(Req=0.00/0.00/0.00, Item=0.00/0.00), QueryIp:(Req=0.00/0.00/0.00, Item=0.00/0.00), Clear:(Req=0.00/0.00/0.00, Item=0.00/0.00), CreateMoveTask:(Req=0.00/0.00), CreateCopyTask:(Req=0.00/0.00), QueryTask:(Req=0.00/0.00), FetchTasks:(Req=0.00/0.00), MarkTaskToComplete:(Req=0.00/0.00) | Eviction: Success/Attempts=0/0, AllocFail=0, keys=0, size=0 B | Discard: Released/Total=0/0, StagingSize=0 B | Snapshots: Success=0, Fail=0

Request counters are reported as rates per second over the time window between two consecutive log outputs (10 seconds by default). Real-time state values (storage, key count, client count, discard staging size) are not rate-limited and reflect the current value at log time.

The metrics block consists of the following sections:

Storage#

Field	Description
`Mem Storage`	Current memory usage / total memory capacity, with percentage
`SSD Storage`	Current SSD-backed storage usage / total SSD capacity

Keys and Clients#

Field	Description
`Keys`	Total number of keys managed by the master
`soft-pinned`	Number of keys with active soft-pin leases (protected from eviction)
`Clients`	Number of currently connected clients

Requests (Success/Total per sec)#

Rate counters for individual (non-batch) RPC requests over the last time window. Each shows <success_rate>/<total_rate> in requests per second:

Counter	Description
`PutStart`	Put object allocation requests
`PutEnd`	Put object commit requests
`PutRevoke`	Put object cancellation requests
`Get`	Get replica list requests
`Exist`	Key existence check requests
`Del`	Single key deletion requests
`DelAll`	Delete-all objects requests
`Ping`	Client heartbeat/ping requests
`CopyStart`	Copy object allocation requests
`CopyEnd`	Copy object commit requests
`CopyRevoke`	Copy object cancellation requests
`MoveStart`	Move object allocation requests
`MoveEnd`	Move object commit requests
`MoveRevoke`	Move object cancellation requests
`EvictDiskReplica`	Evict disk replica requests

Batch Requests (per sec)#

Batch operations aggregate multiple items into a single RPC. Rates are per second over the last time window. Format: Req=<success>/<partial_success>/<total>, Item=<success_items>/<total_items>:

Counter	Description
`PutStart`	Batch put object allocation requests
`PutEnd`	Batch put object commit requests
`PutRevoke`	Batch put object cancellation requests
`Get`	Batch get replica list requests
`ExistKey`	Batch key existence check requests
`QueryIp`	Batch query IP requests
`Clear`	Batch replica clear requests

A request is considered “partial success” when it succeeds for some items but not all.

Task Operations#

Counter	Description
`CreateMoveTask`	Move task creation requests
`CreateCopyTask`	Copy task creation requests
`QueryTask`	Task status query requests
`FetchTasks`	Pending task fetch requests (polled by store clients)
`MarkTaskToComplete`	Task completion acknowledgement requests

Eviction & Discard#

Eviction counters are deltas between two consecutive log outputs — they show what happened in the time window, not cumulative totals.

Field	Description
`Eviction: Success/Attempts`	Eviction rounds that succeeded at least partially vs. total attempts in this window
`AllocFail`	Number of PutStart/UpsertStart failures caused by replica allocation failure (triggers eviction) in this window
`keys`	Number of keys evicted in this window
`size`	Total size of evicted data in this window
`Discard: Released/Total`	Released (cleaned up) vs. total discarded PutStart staging replicas (live values)
`StagingSize`	Current size of discarded but not-yet-released staging buffers (live value)

Prometheus Metrics Endpoint#

Mooncake master exposes Prometheus-format metrics at the HTTP admin endpoint. This allows integration with Prometheus, Grafana, or any Prometheus-compatible monitoring stack.

Endpoints#

The admin HTTP server runs on metrics_port (default: 9003) and exposes the following endpoints:

Endpoint	Content-Type	Description
`GET /metrics`	`text/plain; version=0.0.4`	All metrics in Prometheus exposition format
`GET /metrics/summary`	`text/plain; version=0.0.4`	Human-readable summary (same content as the periodic log)
`GET /health`	`application/json`	Health check with role, HA state, and service readiness
`GET /role`	`text/plain`	Current HA role (`leader` / `standby`)
`GET /ha_status`	`text/plain`	Current HA runtime state (`serving` / `starting` / etc.)

Usage#

Scrape the /metrics endpoint with Prometheus:

Add a scrape config to your prometheus.yml:

scrape_configs:
  - job_name: 'mooncake-master'
    static_configs:
      - targets: ['<master-host>:9003']
    metrics_path: '/metrics'

Quick check with curl:

# Get Prometheus metrics
curl http://<master-host>:9003/metrics

# Get human-readable summary
curl http://<master-host>:9003/metrics/summary

# Check health
curl http://<master-host>:9003/health

Configuration#

The admin HTTP server is configured in the master config file (master.json or master.yaml):

{
  "enable_metric_reporting": true,
  "metrics_port": 9003,
  ...
}

Set enable_metric_reporting to false to disable the periodic metrics log. HTTP endpoints (/metrics, /health, etc.) remain available regardless of this setting.