SSD Offload Design#
Overview#
Mooncake Store supports offloading KV cache objects from distributed memory to local SSD. This extends the effective cache capacity beyond DRAM limits at lower cost, while preserving the performance characteristics of the hot path through zero-copy RDMA-based memory transfers.
SSD offload is implemented as a background subsystem within the real client process. It is transparent to the application: a Put that would otherwise be evicted from memory is persisted to disk, and a Get that finds no memory replica automatically falls back to reading from SSD.
Architecture#
┌─────────────────────────────────────────────────────────┐
│ Application (vLLM, etc.) │
└──────────────────────────┬──────────────────────────────┘
│ MooncakeDistributedStore API
▼
┌─────────────────────────────────────────────────────────┐
│ Real Client │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ FileStorage │ │
│ │ ┌────────────┐ ┌──────────────────────────┐ │ │
│ │ │ Heartbeat │ │ ClientBuffer (staging) │ │ │
│ │ │ Thread │ └──────────────────────────┘ │ │
│ │ └─────┬──────┘ │ │
│ │ │ offload / load │ │
│ │ ▼ │ │
│ │ ┌─────────────────────────────────────────┐ │ │
│ │ │ StorageBackendInterface │ │ │
│ │ │ ┌───────────┐ ┌──────────┐ ┌────────┐ │ │ │
│ │ │ │ Bucket │ │FilePerKey│ │Offset │ │ │ │
│ │ │ │ Backend │ │ Backend │ │Alloc. │ │ │ │
│ │ │ └───────────┘ └──────────┘ └────────┘ │ │ │
│ │ └─────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ In-memory distributed KV cache │ │
│ │ (Transfer Engine / RDMA) │ │
│ └──────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
│
Local SSD / NVMe
The key components are:
FileStorage: The top-level coordinator. It owns the storage backend, a staging buffer (
ClientBuffer), and background threads for heartbeating and buffer garbage collection.StorageBackendInterface: An abstract interface implemented by three backends (see below). Responsible for the actual on-disk layout and I/O.
Heartbeat thread: Periodically contacts the master. The master returns a list of objects to offload; the heartbeat thread writes them to SSD and notifies the master of completion.
ClientBuffer: A pre-registered, O_DIRECT-aligned staging area used for zero-copy reads from SSD back into application memory.
Data Flow#
Offload (memory → SSD)#
The offload path is driven entirely by the heartbeat thread inside FileStorage. No write path from the application is involved.
Heartbeat Thread Master Local Memory Segment
│ │ │
│─OffloadObjectHB ───▶│ │
│◀─ {key→size} map ───│ (objects to evict from │
│ │ memory to SSD) │
│ │ │
│─ BatchQuery(keys) ───────────────────────────────▶│
│◀─ {key→Slice} ────────────────────────────────────│
│ │ │
│ [PrepareEviction: remove old buckets, notify master via BatchEvictDiskReplica]
│─ BatchEvictDiskReplica(evicted_keys) ────────────▶│ (master removes stale replicas)
│ [FinalizeEviction: delete evicted files]
│ │ │
│ BatchOffload(slices) → StorageBackend → SSD │
│ │ │
│─ NotifyOffloadSuccess(keys, metadata) ───────────▶│
│ │ (master adds LOCAL_DISK │
│ │ replica to object entry) │
Step by step:
Heartbeat: The heartbeat thread wakes up every
MOONCAKE_OFFLOAD_HEARTBEAT_INTERVAL_SECONDSseconds and callsclient_->OffloadObjectHeartbeat(enable_offloading_, offloading_objects). The master replies with a map of{key → size}for objects it has selected to evict from memory.Read slices from memory:
FileStorage::OffloadObjectsgroups the keys into buckets (forBucketStorageBackend) and callsBatchQuerySegmentSlicesto obtain{key → Slice}from the local memory segment viaclient_->BatchQuery.Eviction (if capacity limit is set): Before writing,
PrepareEvictionremoves old buckets from metadata under the exclusive lock and collects their keys. Theeviction_handlercallback callsclient_->BatchEvictDiskReplicato notify the master in a single RPC.FinalizeEvictionthen deletes the corresponding files.Write to SSD:
StorageBackend::BatchOffloadserializes and writes the key-value data to disk.Notify master: On success, the
complete_handlercallsclient_->NotifyOffloadSuccess(keys, metadatas). The master adds aLOCAL_DISKreplica entry (carrying the real client’s RPC address astransport_endpoint) to the object’s replica list.
Load (SSD → memory)#
The load path involves three parties: the requesting client, the target client that holds the SSD data, and the Transfer Engine for zero-copy data movement.
Requesting Client Target Client Master
│ │ │
│─ BatchGet(keys) ──────────────────────────────────────────▶│
│◀─ QueryResult {replicas: [LOCAL_DISK(rpc_addr)]} ──────────│
│ │ │
│ (no memory replica available) │ │
│─ batch_get_offload_object(keys, sizes) ───────────────────▶│
│ │ │
│ FileStorage::BatchGet │
│ → StorageBackend::BatchLoad │
│ → read from SSD into ClientBuffer │
│ │ │
│◀─ BatchGetOffloadObjectResponse ───│ │
│ {batch_id, pointers[], transfer_engine_addr, gc_ttl_ms} │
│ │ │
│─ Transfer Engine: BatchGetOffloadObject ──────────────────▶│
│ (RDMA/TCP: pull data from ClientBuffer into app memory) │
│◀─ done ────────────────────────────│ │
│ │ │
│─ release_offload_buffer(batch_id) ────────────────────────▶│
│ │ (free ClientBuffer slot)│
Step by step:
Query master: The requesting client calls
client_->BatchGet(keys, ...)to query the master for replica locations. If the object has been offloaded, the master returns aLOCAL_DISKreplica descriptor containing the target client’s RPC address (transport_endpoint).RPC to target client: The requesting client calls
batch_get_offload_object(keys, sizes)on the target client identified bytransport_endpoint. The target client callsFileStorage::BatchGet, which allocates slots inClientBufferand reads the requested objects from SSD viaStorageBackend::BatchLoad.Response with buffer pointers: The target client returns a
BatchGetOffloadObjectResponsecontainingbatch_id, a list of bufferpointers(addresses withinClientBuffer), the Transfer Engine address, andgc_ttl_ms(the buffer lease TTL).Zero-copy transfer: The requesting client invokes
client_->BatchGetOffloadObject(transfer_engine_addr, keys, pointers, slices), which uses the Transfer Engine (RDMA or TCP) to pull the data directly from the target client’sClientBufferinto the application’s target memory (DRAM or VRAM). No intermediate copy is made on the requesting client side.Release buffer: After the transfer completes, the requesting client immediately calls
release_offload_buffer(batch_id)on the target client to free theClientBufferslots. If the transfer takes longer thangc_ttl_ms, the buffer GC thread reclaims the slot automatically as a fallback.
Storage Backends#
BucketStorageBackend (default)#
Objects are grouped into buckets before being written to disk. Each bucket produces two files:
.bucket— binary data file containing serialized key-value records.meta— metadata file describing the keys and byte offsets within the data file
Bucket IDs are monotonically increasing timestamps with a sequence suffix, so buckets_ (a std::map<int64_t, BucketMetadata>) is always ordered by creation time.
Grouping strategy (GroupOffloadingKeysByBucket): objects are accumulated into a bucket until either bucket_size_limit (default 256 MB) or bucket_keys_limit (default 500) is reached. Objects that do not fill a complete bucket are held in ungrouped_offloading_objects_ and retried on the next heartbeat.
In-flight read tracking: A BucketReadGuard RAII object increments BucketMetadata::inflight_reads_ on construction and decrements it on destruction. This allows safe deletion of bucket files even when concurrent reads are in progress.
StorageBackendAdaptor (FilePerKey)#
Each object is stored as an individual file. The file path is derived from the key via a two-level hash-sharded directory structure to avoid large flat directories. This backend is simple and easy to inspect but does not scale well to millions of objects.
OffsetAllocatorStorageBackend#
A single pre-allocated file (kv_cache.data) is shared by all objects. Space within the file is managed by an OffsetAllocator. Metadata is sharded across 1024 independent maps to reduce lock contention under high concurrency. Records follow the layout [key_len: u32 | value_len: u32 | key | value].
Eviction (BucketStorageBackend)#
When MOONCAKE_BUCKET_MAX_TOTAL_SIZE is set, the backend evicts existing buckets to make room before writing a new one. Eviction is disabled by default (BucketEvictionPolicy::NONE).
Policies#
Policy |
Candidate selection |
|---|---|
|
|
|
|
last_access_ns_ is an atomic int64_t updated on every BatchLoad with relaxed ordering. Buckets that have never been read have last_access_ns_ == 0 and are therefore always evicted first under LRU, giving FIFO-among-unread semantics.
Two-phase eviction protocol#
Eviction is split into two phases to ensure that the master is notified before files are deleted, and that no in-flight reads are interrupted.
Phase 1 — PrepareEviction(required_size) (called under exclusive lock):
Repeatedly call
SelectEvictionCandidate()untiltotal_size_ + required_size <= max_total_size.For each selected bucket: remove it from
buckets_andobject_bucket_map_, subtract its size fromtotal_size_.Collect all evicted keys and bucket metadata into a
PendingEvictionstruct and return it — no file I/O at this point.
Between phases — notify master:
The caller invokes the eviction_handler callback with the full list of evicted keys. The handler calls MasterClient::BatchEvictDiskReplica, which sends a single RPC to the master to remove the disk replicas for all evicted keys atomically.
Phase 2 — FinalizeEviction(pending) (called after master notification):
For each evicted bucket:
Spin-wait (with a 10-second timeout) until
inflight_reads_ == 0.Evict any stale file-handle cache entries.
Delete the
.bucketand.metafiles.
This ordering guarantees:
The master never serves a stale disk-replica location for a file that has already been deleted.
Ongoing reads complete successfully before their files are removed.
Freed disk space is available for the incoming write before
WriteBucketis called.
io_uring File I/O#
When MOONCAKE_USE_URING=true, the storage backends replace POSIX pread/pwrite calls with an io_uring-based implementation (UringFile). The design prioritizes eliminating inter-thread lock contention, which was the dominant latency source in the previous global-ring approach.
Fixed-buffer registration#
The ClientBuffer (the staging buffer used for SSD reads) is registered with io_uring as a fixed buffer via io_uring_register_buffers. When a read destination falls within the registered region, the backend uses io_uring_prep_read_fixed instead of io_uring_prep_read, which avoids a per-I/O mmap/munmap in the kernel and reduces system-call overhead.
Buffer registration is global but applied lazily per thread: g_buf stores the base address and length atomically; each thread-local ring calls ensure_buf_registered() on its first I/O and registers the buffer independently. This avoids a global barrier at startup.
To prevent io_uring’s FOLL_LONGTERM page pinning from failing on systems with Transparent Huge Pages (THP) enabled, MADV_NOHUGEPAGE is applied to the buffer region before registration. This forces the kernel to back the range with 4 KB pages, making long-term pinning reliable regardless of system THP policy.
O_DIRECT and alignment#
UringFile supports an optional O_DIRECT mode. When enabled:
All file descriptors are opened with
O_DIRECT.Buffers, lengths, and offsets must be aligned to 4 KB (
ALIGNMENT_ = 4096).For unaligned writes (e.g., metadata serialized into a
std::string), the backend allocates a temporary aligned bounce buffer viaposix_memalign, copies the data, performs the aligned write, and frees the bounce buffer.read_alignedandwrite_alignedare the primary I/O paths; they assert alignment constraints and delegate directly to the ring.
I/O operations#
Method |
Description |
|---|---|
|
Contiguous read or write, chunked into up to |
|
Same as above but with alignment preconditions for O_DIRECT |
|
Submits multiple independent reads (different offsets) in batches of up to |
|
Scatter/gather I/O: one SQE per |
|
Issues |
Integration with storage backends#
BucketStorageBackend: uses
UringFilefor both bucket data files and metadata files whenuse_uring_is set. A file-handle cache (file_cache_) avoids repeatedopen/closefor hot buckets. On eviction, the cache entry is explicitly removed before the file is deleted to prevent stale handles.OffsetAllocatorStorageBackend: opens the single pre-allocated data file with
O_DIRECTandUringFile, and usesGetFileInstance()to expose the file handle for external buffer registration.StorageBackendAdaptor (FilePerKey): uses
UringFilefor reads whenuse_uring_is set; writes use POSIX paths.
Metadata Recovery on Restart#
On startup, FileStorage::Init calls StorageBackend::ScanMeta, which reads all on-disk metadata and invokes a callback for each discovered object. The callback calls MasterClient::NotifyOffloadSuccess to re-register the objects with the master. This restores the full disk-replica view without any application-level intervention.