SSD Free-Ratio-First Allocation Design#
Overview#
Mooncake Store distributes KV cache objects across multiple memory segments hosted on different nodes. The master’s allocation strategy decides which segment receives each new object replica. When using DDR-only strategies such as random or free_ratio_first, the allocator ignores SSD state entirely. In deployments where some segments have SSD offload enabled and others do not, this blind allocation can concentrate traffic on a small subset of segments whose SSD capacity is quickly exhausted, while segments with ample SSD headroom remain underutilized.
This document describes SsdFreeRatioFirstAllocationStrategy, an allocation strategy that ranks candidate segments by their SSD free ratio and preferentially allocates to segments with the most available SSD space. It integrates with the existing allocation framework and adds SSD usage tracking to LocalDiskSegment so that the master can make informed placement decisions.
Architecture#
Allocation Request
│
▼
┌───────────────────────┐
│ Sample candidate │
│ segments (up to │
│ 6 * replica_num) │
└───────────┬───────────┘
│
▼
┌───────────────────────┐
│ Compute SSD free │
│ ratio per candidate │
└───────────┬───────────┘
│
▼
┌───────────────────────┐
│ Sort by SSD free │
│ ratio descending │
└───────────┬───────────┘
│
▼
┌───────────────────────┐
│ Allocate from top │
│ candidates │
└───────────┬───────────┘
│
┌──────────┴──────────┐
│ Remaining replicas? │
└──────┬──────┬───────┘
│ │
Yes ◄┘ └► No → Done
│
▼
┌───────────────────────┐
│ Fallback: random │
│ allocation for │
│ remaining replicas │
└───────────────────────┘
The flow follows the same high-level structure as the existing FreeRatioFirstAllocationStrategy: sample a subset of candidates, compute a ranking metric, sort, and allocate from the top. The key difference is that the ranking metric is SSD free ratio rather than DRAM free ratio.
MasterService only passes an SsdMetricsProvider when the effective allocation strategy is SSD_FREE_RATIO_FIRST. Non-SSD strategies receive nullptr, so they avoid unnecessary local-disk segment access.
Core Algorithm#
Candidate sampling#
For each allocation request requesting replica_num replicas, the strategy samples min(6 * replica_num, total_segments) candidate segments. This bounded sampling keeps the sorting cost predictable regardless of cluster size while still providing a statistically diverse candidate set.
SSD free ratio#
For each sampled segment, the SSD free ratio is computed as:
ssd_free_ratio = (ssd_total_capacity - ssd_used_bytes) / ssd_total_capacity
A segment with 1 TB total SSD capacity and 200 GB used has an SSD free ratio of 0.80. A segment whose SSD is full has a ratio of 0.0.
Before calculating the ratio, ssd_used_bytes is clamped to [0, ssd_total_capacity]. This keeps transient concurrent accounting drift from producing a negative free ratio or a value greater than 1.0. If no SSD metrics provider is available, or if the reported total capacity is not positive, the strategy treats the segment as fully free.
Sorting#
Candidates are sorted by SSD free ratio in descending order. Segments with more available SSD space appear first and are preferred for allocation.
Preferred segments#
As with other allocation strategies, segments marked as preferred by the caller are handled first. Preferred segments bypass the SSD free ratio ranking and are allocated immediately if they have sufficient capacity.
Fallback to random allocation#
After allocating from the SSD-ranked candidates, any remaining replicas that could not be satisfied are allocated using the standard random strategy as a fallback. This ensures that allocation succeeds even when SSD metrics are unavailable (for example, on segments without SSD offload configured).
SSD Usage Tracking#
ssd_used_bytes counter#
LocalDiskSegment maintains ssd_total_capacity_bytes, updated by ReportSsdCapacity, and an atomic counter ssd_used_bytes that tracks the total number of bytes currently occupied by offloaded replicas on the segment’s SSD. ssd_used_bytes is updated alongside metadata changes:
Increment:
NotifyOffloadSuccessincrementsssd_used_bytesby the object size only after the master successfully adds aLOCAL_DISKreplica to the object entry. If the object has already disappeared from metadata, the notification is ignored and the counter is not changed.Decrement: The master decrements
ssd_used_bytesviaReleaseLocalDiskUsagewhenever aLOCAL_DISKreplica is removed from metadata — on full object deletion (EraseMetadata), and on partial replica removal through the shared erase helper (EraseReplicasWithCacheTotalAccounting, used by batch clear, revoke, and stale-handle cleanup) and the local-disk eviction path.ReleaseLocalDiskUsageiterates onlyLOCAL_DISKreplicas and is a no-op for other replica types, so it is safe to call on any replica set.
The counter is atomic to allow concurrent updates from multiple RPC handler threads without requiring a separate lock. The allocation strategy treats it as an eventually consistent placement signal and clamps it before computing the free ratio.
SsdMetricsProvider interface#
ScopedLocalDiskSegmentAccess implements the SsdMetricsProvider interface, which exposes two methods:
Method |
Return type |
Description |
|---|---|---|
|
|
Total SSD capacity configured for the segment, in bytes |
|
|
Current SSD usage, read from |
The allocation strategy queries these methods through the SsdMetricsProvider interface, keeping the strategy decoupled from the concrete segment implementation.
Configuration Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
string |
|
Set to |
The parameter is passed as a gflag to the master process at startup.
Code Structure#
File |
Change |
|---|---|
|
Add |
|
Add |
|
Add |
|
Implement |
|
Pass SSD metrics only to |
Usage Example#
Start the master with the SSD free-ratio-first strategy:
./mooncake_master --allocation_strategy=ssd_free_ratio_first
With this configuration:
The master samples up to
6 * replica_numcandidate segments for each allocation request.Candidates are ranked by SSD free ratio (descending).
Allocation proceeds from the top-ranked candidates.
Any remaining replicas fall back to random allocation.