EngramStore Backend

EngramStore Backend#

Mooncake provides EngramStore as the storage backend for Engram embedding tables.

The scope is intentionally narrow:

the caller defines the physical table layout
the caller uploads one table per head
the caller provides precomputed row ids with shape [B, L, H]
Mooncake returns the selected rows as [B, L, H, D]

Mooncake does not implement tokenizer compression, N-gram hashing, query logic, or any other model-side Engram algorithm.

Current Backend Boundary#

The current implementation is intentionally conservative. It keeps EngramStore on top of the existing Store interfaces and does not depend on:

transfer scatter read
grouped transfer task
get_into_range
batch_query
local direct mapping
query cache
remote gather control-plane changes

Those optimizations are deferred to follow-up PRs so that the EngramStore backend can land first as a small, reviewable unit.

Configuration#

EngramStoreConfig contains the physical layout for one EngramStore layer:

table_vocab_sizes: per-head table sizes [N_0, N_1, ..., N_{H-1}]
embedding_dim: row width D

For layer_id, Mooncake generates one store key per head:

engram:l{layer_id}:h{head_idx}

Each key stores a float32 table with shape [N_h, D].

Public Interface#

Python:

EngramStore(layer_id, config, store=None)
populate(embedding_buffers)
lookup(row_ids)
remove_from_store(force=False)
get_table_vocab_sizes()
get_store_keys()
get_num_heads()
get_embedding_dim()

The Python store argument accepts the existing MooncakeDistributedStore wrapper, or None for metadata-only construction.

C++:

constructor EngramStore(int layer_id, const EngramStoreConfig&, std::shared_ptr<PyClient>)
populate(...)
lookup_rows(...)
lookup_rows_contiguous(...)
remove_from_store(...)
metadata getters matching the Python surface

Data Contract#

Populate expects one NumPy float32 array per head:

embedding_buffers[h].shape == [N_h, D]

Lookup accepts either:

nested Python lists with logical shape [B, L, H], or
a contiguous NumPy int64 array with shape [B, L, H]

Lookup returns:

output.shape == [B, L, H, D]

Populate Flow#

Populate follows the existing Store write path:

validate that exactly one table is provided for each head
validate that every table matches [N_h, D]
verify that the target head-table keys do not already exist
register each embedding table buffer
upload all head tables with batch_put_from(...)
unregister the staging buffers

populate(...) is defined as a create-only operation for one EngramStore layer. To reuse a layer_id, first remove the old tables with remove_from_store(...).

If upload fails after some head tables have already been written, or if publish finishes but post-write buffer cleanup fails, the backend best-effort removes the keys written by the failed populate attempt before returning an error.

Lookup Flow#

Each lookup follows the same simplified backend flow:

validate the row_ids shape and bounds
build per-head byte ranges for the requested rows
issue one get_into_ranges(...) call to materialize those rows into the output buffer

For NumPy row_ids, the binding uses a contiguous fast path and builds ranges directly from the input tensor without first converting the entire input into a nested C++ container.

Validation#

The backend enforces these invariants:

table_vocab_sizes is non-empty and every entry is positive
embedding_dim is positive
populate(...) receives exactly one table per head
every populated table matches [N_h, D]
lookup(...) receives a non-empty [B, L, H] input
every row id satisfies 0 <= row_ids[..., h] < N_h

Validation Status#

This backend is covered by:

correctness tests in scripts/test_engram_store.py
benchmark coverage in scripts/bench_engram_store_27b.py

scripts/test_engram_store.py can run against an existing Mooncake deployment through MOONCAKE_CONFIG_PATH / MOONCAKE_MASTER, or it can start a local mooncake_master instance automatically for a self-contained TCP test run.

By default, the benchmark exercises engram_store.populate(...) directly. Its fallback populate paths are gated behind ENGRAM_ALLOW_POPULATE_FALLBACK=1 so they do not silently mask regressions in the current implementation.