Mooncake Conductor Architecture#
Overview#
Mooncake Conductor is the kv-cache indexer used by cache-aware routers. It subscribes to KV cache events from inference engines or storage backends, normalizes those events, maintains a global prefix cache table, and exposes HTTP APIs for dynamic service registration and cache-hit queries.
The design goal is to let routers answer a simple scheduling question: for this request prefix, which registered instance has the best reusable KV cache locality, and on which cache tiers is that prefix available?
Architecture#
flowchart TD
subgraph Router["Gateway or router"]
HTTPClient["HTTP client"]
end
subgraph Conductor["Mooncake Conductor"]
EventManager["EventManager"]
ZMQClient["ZMQClient"]
KVEventHandler["KVEventHandler"]
PrefixCacheTable["PrefixCacheTable"]
end
subgraph Publishers["vLLM, SGLang, Mooncake, or cache daemons"]
InferenceService["Inference or cache service"]
KVEvent["KV event stream"]
end
EventManager -->|creates and manages| ZMQClient
ZMQClient -->|registers handler| KVEventHandler
InferenceService -->|publishes events| KVEvent
KVEvent -->|ZMQ SUB or DEALER| ZMQClient
ZMQClient -->|decoded batches| KVEventHandler
KVEventHandler -->|normalizes events| PrefixCacheTable
HTTPClient -->|/query, /register, /unregister| EventManager
EventManager -->|cache-hit computation| PrefixCacheTable
PrefixCacheTable -->|hit result by instance| EventManager
EventManager -->|HTTP response| HTTPClient
Components#
Component |
Responsibility |
|---|---|
|
Owns the Conductor lifecycle, HTTP server, dynamic registration, active service map, and tenant-to-instance map. |
|
Connects to publisher endpoints, consumes event frames, decodes event batches, tracks sequence numbers, and requests replay after reconnects. |
|
Adapts decoded engine events into Conductor store/remove events enriched with registration metadata. |
|
Maintains model-context-specific prefix maps, engine-hash to conductor-hash mappings, medium metadata, DP-rank metadata, and query-time hit computation. |
Data model#
The prefix index is scoped by ModelContext:
(tenant_id, model_name, lora_name, block_size, additional_salt, instance_id)
Within each context, Conductor stores:
a mapping from engine-provided block hash to Conductor prefix hash;
a prefix hash map that records replica count, medium set, DP-rank set, and per-instance access metadata;
a DP-rank set used to report rank-level hit information.
The current implementation computes complete-block prefix hashes from token IDs
and ignores trailing partial blocks during /query.
Event flow#
A service is registered statically from
conductor_config.jsonor dynamically throughPOST /register.EventManagercreates oneZMQClientper(instance_id, tenant_id, dp_rank)service key.ZMQClientsubscribes to the publisher endpoint and consumes frames in the form[topic, sequence, payload].The payload is decoded into a batch of engine events. Today, the implemented parser supports vLLM
BlockStoredandBlockRemovedmsgpack events.KVEventHandlerenriches events with registration metadata such as model, LoRA, tenant, instance, block size, and additional salt.PrefixCacheTableupdates the prefix map for stored or removed blocks.If a reconnect detects missed sequence numbers, the
replay endpointcan be used to request missed events.
Query flow#
A router obtains prompt token IDs, usually from an engine tokenizer endpoint.
The router calls
POST /querywithmodel,token_ids,block_size, and optionaltenant_id,instance_id,lora_name, andcache_salt.Conductor computes complete-block prefix hashes for the request.
The prefix table is scanned in order. The first miss terminates the scan so prefix continuity is preserved.
Conductor returns per-instance
longest_matched, medium hit counts, and DP-rank hit counts.The router selects the best target instance and forwards the request.
Dynamic registration#
Conductor supports runtime registration so routers or control planes can add and remove KV event publishers without restarting the process.
{
"endpoint": "tcp://127.0.0.1:5557",
"replay_endpoint": "tcp://127.0.0.1:5558",
"type": "vLLM",
"modelname": "qwen2.5",
"lora_name": "",
"tenant_id": "default",
"instance_id": "vllm-prefill-node1",
"block_size": 128,
"dp_rank": 0,
"additionalsalt": ""
}
Static configuration uses the same fields under kvevent_instance:
{
"http_server_port": 13333,
"kvevent_instance": {
"vllm-prefill-node1": {
"endpoint": "tcp://127.0.0.1:5557",
"replay_endpoint": "tcp://127.0.0.1:5558",
"type": "vLLM",
"modelname": "qwen2.5",
"lora_name": "",
"tenant_id": "default",
"instance_id": "vllm-prefill-node1",
"block_size": 128,
"dp_rank": 0,
"additionalsalt": ""
}
}
}
Environment variables#
Variable |
Default |
Description |
|---|---|---|
|
|
Log level: |
|
|
Path to the static configuration file. |
|
random |
Legacy seed option for hash computation experiments. |
Build and run#
cd mooncake-conductor/conductor-ctrl
go mod tidy
go build -o mooncake_conductor .
export CONDUCTOR_CONFIG_PATH=../example/conductor_config.json
export CONDUCTOR_LOG_LEVEL=INFO
./mooncake_conductor
Project structure#
mooncake-conductor/
+-- conductor-ctrl/
| +-- common/ # shared types, helpers, and sync map
| +-- kvevent/ # EventManager and KVEventHandler
| +-- prefixindex/ # prefix cache table and hit computation
| +-- zmq/ # ZMQ client, event decoding, event types
| +-- main.go # process entry point
+-- example/ # demo config and cache-aware proxy
+-- build.sh
+-- CMakeLists.txt
See Indexer API for the HTTP API and KV Events wire format.