Mooncake Conductor Indexer API

Mooncake Conductor Indexer API#

Overview#

Mooncake Conductor is a KV cache indexer used by routers and gateways to make cache-aware scheduling decisions. It consumes KV cache events from inference engines or storage backends, maintains prefix-hit metadata across cache tiers, and exposes HTTP APIs for service registration and cache-hit queries.

This document incorporates the latest API direction from:

The Conductor can serve multiple model groups in one process. Each query is scoped by model identity, block size, LoRA identity, tenant isolation, and the registered instance that can receive traffic.

Concepts#

Storage tiers#

The indexer tracks KV cache availability across three logical tiers:

G1, Device Pool: Device-resident KV blocks, such as GPU, NPU, HBM, or other accelerator memory owned by inference engines.
G2, Host Pool: CPU or host DRAM KV blocks, including Mooncake registered memory pools.
G3, Disk Pool: SSD, 3FS, DFS, NFS, or other disk-backed KV storage.

The medium field identifies the concrete tier or device type. Common values are gpu, cpu, and disk. Engines may add other values as new media are supported.

Identity dimensions#

KV cache hits are interpreted under the following dimensions:

Dimension	Description
`model_name` or `model`	Model identifier. KV blocks from different models are incompatible.
`block_size`	Number of tokens per KV block. Different block sizes produce different token-to-block mappings.
`additional_salt`	Opaque salt used to separate hash namespaces for quantization, model revision, tenant isolation, or other deployment-specific dimensions.
`cache_salt`	Ensure cached data blocks are kept separate for different customers
`lora_name`	LoRA adapter name. Empty or `null` means the base model.
`tenant_id`	Upstream tenant or customer identity. Used for isolation and to keep query output bounded.
`instance_id`	Routable API server or engine instance returned by the Indexer API. Routers use this value as the scheduling target.
`backend_id`	KV Events identity for the entity that owns the KV blocks. It may be an inference worker, a Mooncake storage daemon, or another cache backend.
`medium`	Cache medium where the blocks are present.
`dp_rank`	Data-parallel rank that owns or can serve the blocks.

instance_id and backend_id intentionally have different meanings. instance_id is the router-facing target in the Indexer API. backend_id is the event-facing cache owner in the KV Events API. In deployments where cache storage is decoupled from inference workers, backend_id can identify a cache daemon while instance_id still identifies the engine endpoint that receives requests.

Hashing standard#

The standardized event contract recommends XXH3-64 with seed S.

Local block hash: XXH3(token_bytes_le, S), where tokens are little-endian u32 values concatenated for one block.
Rolling sequence hash: The first block uses seq_hash[0] = local_block_hash[0]. Each subsequent block uses:
```
seq_hash[i] = XXH3(seq_hash[i-1]_le || local_block_hash[i]_le, S)
```
Here || means byte concatenation, not a logical OR.

All hashes used by the standardized KV Events API are rolling sequence hashes. A seq_hash identifies the whole prefix up to that block depth, so equal prefixes produce equal hashes until the first differing block.

If an engine does not follow the standardized hashing scheme, it must provide token_ids in stored events so the consumer can recompute the indexer’s hash representation.

HTTP APIs#

`POST /register`#

Registers a KV event publisher and starts consuming events from it.

{
  "endpoint": "tcp://1.1.1.1:5557",
  "replay_endpoint": "tcp://1.1.1.1:5558",
  "type": "vLLM",
  "modelname": "deepseek",
  "lora_name": "sql-adapter",
  "tenant_id": "default",
  "instance_id": "vllm-prefill-node1",
  "block_size": 128,
  "dp_rank": 0,
  "additionalsalt": "w8a8"
}

Field	Required	Description
`endpoint`	Yes	ZMQ KV event publisher endpoint.
`replay_endpoint`	No	ZMQ replay endpoint used to recover missed events.
`type`	Yes	Publisher type, such as `vLLM`, `SGLang`, or `Mooncake`.
`modelname`	Yes	Model name for this publisher. This is the HTTP API wire name for `model_name`.
`lora_name`	No	LoRA adapter name. Empty or omitted means base model.
`tenant_id`	No	Tenant identity. Defaults to `default`.
`instance_id`	Yes	Router-facing engine or API server instance identity.
`block_size`	Yes	KV block size in tokens.
`dp_rank`	Yes	Data-parallel rank for this publisher.
`additionalsalt`	No	HTTP API wire name for `additional_salt`. Defaults to an empty string.

Successful response:

{
  "status": "registered successfully",
  "instance_id": "vllm-prefill-node1"
}

`POST /unregister`#

Stops consuming events for a registered publisher.

{
  "type": "vLLM",
  "modelname": "deepseek",
  "lora_name": "sql-adapter",
  "tenant_id": "default",
  "instance_id": "vllm-prefill-node1",
  "block_size": 128,
  "dp_rank": 0
}

tenant_id defaults to default. The current implementation removes the subscription identified by (instance_id, tenant_id, dp_rank).

Successful response:

{
  "status": "unregistered successfully",
  "removed_instances": ["vllm-prefill-node1|default|0"]
}

`POST /query`#

Query cache hits by token IDs.

{
  "model": "deepseek",
  "lora_name": "sql-adapter",
  "token_ids": [101, 15, 100, 55, 89],
  "tenant_id": "default",
  "instance_id": "vllm-prefill-node1",
  "block_size": 64,
  "cache_salt": "w8a8"
}

Field	Required	Description
`model`	Yes	Model name.
`lora_name`	No	LoRA adapter name. Empty or omitted means base model.
`lora_id`	No	Deprecated compatibility field. Do not use together with `lora_name`.
`token_ids`	Yes	Prompt token IDs. Only complete blocks are considered.
`tenant_id`	No	Tenant identity. Defaults to `default`.
`instance_id`	No	If set, query one instance. If omitted, query all instances registered under the tenant.
`block_size`	Yes	KV block size in tokens.
`cache_salt`	No	Query-side hash namespace salt. Corresponds to the event `additional_salt` concept.

Response:

{
  "default": {
    "vllm-prefill-node1": {
      "longest_matched": 256,
      "GPU": 128,
      "DP": {
        "0": 128,
        "1": 256
      },
      "CPU": 256,
      "DISK": 0
    }
  }
}

Field	Description
`longest_matched`	Longest continuous prefix hit in tokens across all tracked media and DP ranks for this instance.
`GPU`, `CPU`, `DISK`	Matched prefix tokens available on each medium. Names are examples; future media can be added.
`DP`	Matched prefix tokens grouped by data-parallel rank.

`POST /query_by_hash`#

Queries cache hits by precomputed rolling sequence hashes. This API avoids sending long token lists over the network.

{
  "model": "deepseek",
  "lora_name": "sql-adapter",
  "seq_hashes": [1234567890, 9876543210],
  "tenant_id": "default",
  "instance_id": "vllm-prefill-node1",
  "block_size": 64,
  "cache_salt": "w8a8"
}

For compatibility with earlier drafts, clients may call the hash list block_hash, but new clients should use seq_hashes to make it explicit that the values are rolling sequence hashes rather than local block hashes.

The response shape is the same as /query. For this API, longest_matched = block_size * matched_hash_count.

KV Events API#

The KV Events API is the wire contract between cache owners and indexers. Conductor normalizes engine-specific events into this model.

Event envelope#

Every standardized event carries the same envelope:

{
  "event_id": 42,
  "timestamp": 1739145600000,
  "event_type": "stored",
  "model_name": "llama-3.1-8b",
  "block_size": 64,
  "additional_salt": null,
  "lora_name": null,
  "tenant_id": "default",
  "backend_id": "worker-0",
  "medium": "gpu",
  "dp_rank": 0
}

Field	Type	Description
`event_id`	`u64`	Monotonically increasing sequence number scoped by the event stream dimensions. Authoritative for ordering.
`timestamp`	`u64 or null`	Unix epoch milliseconds. Informational only, not used for ordering.
`event_type`	`string`	One of `stored`, `removed`, or `cleared`.
`model_name`	`string or null`	Model identifier.
`block_size`	`u32 or null`	Tokens per block.
`additional_salt`	`string or null`	Opaque deployment salt or namespace.
`lora_name`	`string or null`	LoRA adapter name, or `null` for the base model.
`tenant_id`	`string`	Tenant or customer identity.
`backend_id`	`string`	Entity that owns the KV blocks. This can be an engine worker or a decoupled cache daemon.
`medium`	`string or null`	Cache medium such as `gpu`, `cpu`, or `disk`.
`dp_rank`	`u32 or null`	Data-parallel rank.

Events must be processed in consecutive event_id order within each stream identified by (model_name, block_size, additional_salt, lora_name, tenant_id, backend_id, medium, dp_rank).

`stored`#

Published when one or more consecutive blocks are committed to a KV cache.

{
  "event_id": 42,
  "timestamp": 1739145600000,
  "event_type": "stored",
  "model_name": "llama-3.1-8b",
  "block_size": 64,
  "additional_salt": null,
  "lora_name": null,
  "tenant_id": "default",
  "backend_id": "worker-0",
  "medium": "gpu",
  "dp_rank": 0,
  "seq_hashes": [1234567890, 9876543210, 1122334455],
  "base_block_idx": 5,
  "parent_hash": 9999999999,
  "token_ids": null
}

Field	Type	Description
`seq_hashes`	`u64[]`	Rolling sequence hashes of consecutive stored blocks.
`base_block_idx`	`u32 or null`	Zero-based depth of the first block in this event.
`parent_hash`	`u64 or null`	Rolling sequence hash at depth `base_block_idx - 1`; `null` at the root.
`token_ids`	`u32[] or null`	Tokens across all blocks in this event. Required when the publisher does not use the standardized hash.

At least one of base_block_idx or parent_hash must be present so the consumer can locate the blocks in the sequence.

`removed`#

Published when one or more blocks are evicted.

{
  "event_id": 43,
  "timestamp": 1739145601000,
  "event_type": "removed",
  "model_name": "llama-3.1-8b",
  "block_size": 64,
  "additional_salt": null,
  "lora_name": null,
  "tenant_id": "default",
  "backend_id": "worker-0",
  "medium": "gpu",
  "dp_rank": 0,
  "seq_hashes": [1122334455],
  "base_block_idx": 7
}

seq_hashes is required. base_block_idx is optional but recommended for collision detection and observability.

`cleared`#

Published when all blocks for the event stream dimensions are purged.

{
  "event_id": 44,
  "timestamp": 1739145602000,
  "event_type": "cleared",
  "model_name": "llama-3.1-8b",
  "block_size": 64,
  "additional_salt": null,
  "lora_name": null,
  "tenant_id": "default",
  "backend_id": "worker-0",
  "medium": "gpu",
  "dp_rank": 0
}

No additional payload fields are required.

Compatibility notes#

The current Conductor implementation consumes vLLM ZMQ msgpack batches and normalizes BlockStored and BlockRemoved into the internal prefix index. Registration metadata supplies fields such as modelname, tenant_id, instance_id, block_size, and additionalsalt when the engine event does not carry the full standardized envelope.

Mooncake Conductor Indexer API

Contents