Mooncake Conductor Indexer API#
Overview#
Mooncake Conductor is a KV cache indexer used by routers and gateways to make cache-aware scheduling decisions. It consumes KV cache events from inference engines or storage backends, maintains prefix-hit metadata across cache tiers, and exposes HTTP APIs for service registration and cache-hit queries.
This document incorporates the latest API direction from:
The Conductor can serve multiple model groups in one process. Each query is scoped by model identity, block size, LoRA identity, tenant isolation, and the registered instance that can receive traffic.
Concepts#
Storage tiers#
The indexer tracks KV cache availability across three logical tiers:
G1, Device Pool: Device-resident KV blocks, such as GPU, NPU, HBM, or other accelerator memory owned by inference engines.
G2, Host Pool: CPU or host DRAM KV blocks, including Mooncake registered memory pools.
G3, Disk Pool: SSD, 3FS, DFS, NFS, or other disk-backed KV storage.
The medium field identifies the concrete tier or device type. Common values
are gpu, cpu, and disk. Engines may add other values as new media are
supported.
Identity dimensions#
KV cache hits are interpreted under the following dimensions:
Dimension |
Description |
|---|---|
|
Model identifier. KV blocks from different models are incompatible. |
|
Number of tokens per KV block. Different block sizes produce different token-to-block mappings. |
|
Opaque salt used to separate hash namespaces for quantization, model revision, tenant isolation, or other deployment-specific dimensions. |
|
Ensure cached data blocks are kept separate for different customers |
|
LoRA adapter name. Empty or |
|
Upstream tenant or customer identity. Used for isolation and to keep query output bounded. |
|
Routable API server or engine instance returned by the Indexer API. Routers use this value as the scheduling target. |
|
KV Events identity for the entity that owns the KV blocks. It may be an inference worker, a Mooncake storage daemon, or another cache backend. |
|
Cache medium where the blocks are present. |
|
Data-parallel rank that owns or can serve the blocks. |
instance_id and backend_id intentionally have different meanings.
instance_id is the router-facing target in the Indexer API. backend_id is
the event-facing cache owner in the KV Events API. In deployments where cache
storage is decoupled from inference workers, backend_id can identify a cache
daemon while instance_id still identifies the engine endpoint that receives
requests.
Hashing standard#
The standardized event contract recommends XXH3-64 with seed S.
Local block hash:
XXH3(token_bytes_le, S), where tokens are little-endianu32values concatenated for one block.Rolling sequence hash: The first block uses
seq_hash[0] = local_block_hash[0]. Each subsequent block uses:seq_hash[i] = XXH3(seq_hash[i-1]_le || local_block_hash[i]_le, S)
Here
||means byte concatenation, not a logical OR.
All hashes used by the standardized KV Events API are rolling sequence hashes.
A seq_hash identifies the whole prefix up to that block depth, so equal
prefixes produce equal hashes until the first differing block.
If an engine does not follow the standardized hashing scheme, it must provide
token_ids in stored events so the consumer can recompute the indexer’s hash
representation.
HTTP APIs#
POST /register#
Registers a KV event publisher and starts consuming events from it.
{
"endpoint": "tcp://1.1.1.1:5557",
"replay_endpoint": "tcp://1.1.1.1:5558",
"type": "vLLM",
"modelname": "deepseek",
"lora_name": "sql-adapter",
"tenant_id": "default",
"instance_id": "vllm-prefill-node1",
"block_size": 128,
"dp_rank": 0,
"additionalsalt": "w8a8"
}
Field |
Required |
Description |
|---|---|---|
|
Yes |
ZMQ KV event publisher endpoint. |
|
No |
ZMQ replay endpoint used to recover missed events. |
|
Yes |
Publisher type, such as |
|
Yes |
Model name for this publisher. This is the HTTP API wire name for |
|
No |
LoRA adapter name. Empty or omitted means base model. |
|
No |
Tenant identity. Defaults to |
|
Yes |
Router-facing engine or API server instance identity. |
|
Yes |
KV block size in tokens. |
|
Yes |
Data-parallel rank for this publisher. |
|
No |
HTTP API wire name for |
Successful response:
{
"status": "registered successfully",
"instance_id": "vllm-prefill-node1"
}
POST /unregister#
Stops consuming events for a registered publisher.
{
"type": "vLLM",
"modelname": "deepseek",
"lora_name": "sql-adapter",
"tenant_id": "default",
"instance_id": "vllm-prefill-node1",
"block_size": 128,
"dp_rank": 0
}
tenant_id defaults to default. The current implementation removes the
subscription identified by (instance_id, tenant_id, dp_rank).
Successful response:
{
"status": "unregistered successfully",
"removed_instances": ["vllm-prefill-node1|default|0"]
}
POST /query#
Query cache hits by token IDs.
{
"model": "deepseek",
"lora_name": "sql-adapter",
"token_ids": [101, 15, 100, 55, 89],
"tenant_id": "default",
"instance_id": "vllm-prefill-node1",
"block_size": 64,
"cache_salt": "w8a8"
}
Field |
Required |
Description |
|---|---|---|
|
Yes |
Model name. |
|
No |
LoRA adapter name. Empty or omitted means base model. |
|
No |
Deprecated compatibility field. Do not use together with |
|
Yes |
Prompt token IDs. Only complete blocks are considered. |
|
No |
Tenant identity. Defaults to |
|
No |
If set, query one instance. If omitted, query all instances registered under the tenant. |
|
Yes |
KV block size in tokens. |
|
No |
Query-side hash namespace salt. Corresponds to the event |
Response:
{
"default": {
"vllm-prefill-node1": {
"longest_matched": 256,
"GPU": 128,
"DP": {
"0": 128,
"1": 256
},
"CPU": 256,
"DISK": 0
}
}
}
Field |
Description |
|---|---|
|
Longest continuous prefix hit in tokens across all tracked media and DP ranks for this instance. |
|
Matched prefix tokens available on each medium. Names are examples; future media can be added. |
|
Matched prefix tokens grouped by data-parallel rank. |
POST /query_by_hash#
Queries cache hits by precomputed rolling sequence hashes. This API avoids sending long token lists over the network.
{
"model": "deepseek",
"lora_name": "sql-adapter",
"seq_hashes": [1234567890, 9876543210],
"tenant_id": "default",
"instance_id": "vllm-prefill-node1",
"block_size": 64,
"cache_salt": "w8a8"
}
For compatibility with earlier drafts, clients may call the hash list
block_hash, but new clients should use seq_hashes to make it explicit that
the values are rolling sequence hashes rather than local block hashes.
The response shape is the same as /query. For this API,
longest_matched = block_size * matched_hash_count.
KV Events API#
The KV Events API is the wire contract between cache owners and indexers. Conductor normalizes engine-specific events into this model.
Event envelope#
Every standardized event carries the same envelope:
{
"event_id": 42,
"timestamp": 1739145600000,
"event_type": "stored",
"model_name": "llama-3.1-8b",
"block_size": 64,
"additional_salt": null,
"lora_name": null,
"tenant_id": "default",
"backend_id": "worker-0",
"medium": "gpu",
"dp_rank": 0
}
Field |
Type |
Description |
|---|---|---|
|
|
Monotonically increasing sequence number scoped by the event stream dimensions. Authoritative for ordering. |
|
|
Unix epoch milliseconds. Informational only, not used for ordering. |
|
|
One of |
|
|
Model identifier. |
|
|
Tokens per block. |
|
|
Opaque deployment salt or namespace. |
|
|
LoRA adapter name, or |
|
|
Tenant or customer identity. |
|
|
Entity that owns the KV blocks. This can be an engine worker or a decoupled cache daemon. |
|
|
Cache medium such as |
|
|
Data-parallel rank. |
Events must be processed in consecutive event_id order within each stream
identified by (model_name, block_size, additional_salt, lora_name, tenant_id, backend_id, medium, dp_rank).
stored#
Published when one or more consecutive blocks are committed to a KV cache.
{
"event_id": 42,
"timestamp": 1739145600000,
"event_type": "stored",
"model_name": "llama-3.1-8b",
"block_size": 64,
"additional_salt": null,
"lora_name": null,
"tenant_id": "default",
"backend_id": "worker-0",
"medium": "gpu",
"dp_rank": 0,
"seq_hashes": [1234567890, 9876543210, 1122334455],
"base_block_idx": 5,
"parent_hash": 9999999999,
"token_ids": null
}
Field |
Type |
Description |
|---|---|---|
|
|
Rolling sequence hashes of consecutive stored blocks. |
|
|
Zero-based depth of the first block in this event. |
|
|
Rolling sequence hash at depth |
|
|
Tokens across all blocks in this event. Required when the publisher does not use the standardized hash. |
At least one of base_block_idx or parent_hash must be present so the
consumer can locate the blocks in the sequence.
removed#
Published when one or more blocks are evicted.
{
"event_id": 43,
"timestamp": 1739145601000,
"event_type": "removed",
"model_name": "llama-3.1-8b",
"block_size": 64,
"additional_salt": null,
"lora_name": null,
"tenant_id": "default",
"backend_id": "worker-0",
"medium": "gpu",
"dp_rank": 0,
"seq_hashes": [1122334455],
"base_block_idx": 7
}
seq_hashes is required. base_block_idx is optional but recommended for
collision detection and observability.
cleared#
Published when all blocks for the event stream dimensions are purged.
{
"event_id": 44,
"timestamp": 1739145602000,
"event_type": "cleared",
"model_name": "llama-3.1-8b",
"block_size": 64,
"additional_salt": null,
"lora_name": null,
"tenant_id": "default",
"backend_id": "worker-0",
"medium": "gpu",
"dp_rank": 0
}
No additional payload fields are required.
Compatibility notes#
The current Conductor implementation consumes vLLM ZMQ msgpack batches and
normalizes BlockStored and BlockRemoved into the internal prefix index.
Registration metadata supplies fields such as modelname, tenant_id,
instance_id, block_size, and additionalsalt when the engine event does
not carry the full standardized envelope.