EngramStore Backend#
Mooncake provides EngramStore as the storage backend for Engram embedding tables.
The scope is intentionally narrow:
the caller defines the physical table layout
the caller uploads one table per head
the caller provides precomputed row ids with shape
[B, L, H]Mooncake returns the selected rows as
[B, L, H, D]
Mooncake does not implement tokenizer compression, N-gram hashing, query logic, or any other model-side Engram algorithm.
Current Backend Boundary#
The current implementation is intentionally conservative. It keeps EngramStore on top of the existing Store interfaces and does not depend on:
transfer scatter read
grouped transfer task
get_into_rangebatch_querylocal direct mapping
query cache
remote gather control-plane changes
Those optimizations are deferred to follow-up PRs so that the EngramStore backend can land first as a small, reviewable unit.
Configuration#
EngramStoreConfig contains the physical layout for one EngramStore layer:
table_vocab_sizes: per-head table sizes[N_0, N_1, ..., N_{H-1}]embedding_dim: row widthD
For layer_id, Mooncake generates one store key per head:
engram:l{layer_id}:h{head_idx}
Each key stores a float32 table with shape [N_h, D].
Public Interface#
Python:
EngramStore(layer_id, config, store=None)populate(embedding_buffers)lookup(row_ids)remove_from_store(force=False)get_table_vocab_sizes()get_store_keys()get_num_heads()get_embedding_dim()
The Python store argument accepts the existing MooncakeDistributedStore
wrapper, or None for metadata-only construction.
C++:
constructor
EngramStore(int layer_id, const EngramStoreConfig&, std::shared_ptr<PyClient>)populate(...)lookup_rows(...)lookup_rows_contiguous(...)remove_from_store(...)metadata getters matching the Python surface
Data Contract#
Populate expects one NumPy float32 array per head:
embedding_buffers[h].shape == [N_h, D]
Lookup accepts either:
nested Python lists with logical shape
[B, L, H], ora contiguous NumPy
int64array with shape[B, L, H]
Lookup returns:
output.shape == [B, L, H, D]
Populate Flow#
Populate follows the existing Store write path:
validate that exactly one table is provided for each head
validate that every table matches
[N_h, D]verify that the target head-table keys do not already exist
register each embedding table buffer
upload all head tables with
batch_put_from(...)unregister the staging buffers
populate(...) is defined as a create-only operation for one EngramStore layer. To
reuse a layer_id, first remove the old tables with remove_from_store(...).
If upload fails after some head tables have already been written, or if publish finishes but post-write buffer cleanup fails, the backend best-effort removes the keys written by the failed populate attempt before returning an error.
Lookup Flow#
Each lookup follows the same simplified backend flow:
validate the
row_idsshape and boundsbuild per-head byte ranges for the requested rows
issue one
get_into_ranges(...)call to materialize those rows into the output buffer
For NumPy row_ids, the binding uses a contiguous fast path and builds ranges
directly from the input tensor without first converting the entire input into a
nested C++ container.
Validation#
The backend enforces these invariants:
table_vocab_sizesis non-empty and every entry is positiveembedding_dimis positivepopulate(...)receives exactly one table per headevery populated table matches
[N_h, D]lookup(...)receives a non-empty[B, L, H]inputevery row id satisfies
0 <= row_ids[..., h] < N_h
Validation Status#
This backend is covered by:
correctness tests in
scripts/test_engram_store.pybenchmark coverage in
scripts/bench_engram_store_27b.py
scripts/test_engram_store.py can run against an existing Mooncake deployment through
MOONCAKE_CONFIG_PATH / MOONCAKE_MASTER, or it can start a local
mooncake_master instance automatically for a self-contained TCP test run.
By default, the benchmark exercises engram_store.populate(...) directly. Its
fallback populate paths are gated behind ENGRAM_ALLOW_POPULATE_FALLBACK=1 so
they do not silently mask regressions in the current implementation.