Welcome to Mooncake

Contents

Welcome to Mooncake#

Mooncake

A KVCache-centric Disaggregated Architecture for LLM Serving.

Star Watch Fork

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. Now both the Transfer Engine and Mooncake Store are open-sourced! This repository also hosts its technical report and the open-sourced traces.

🔄 Updates

  • Feb 12, 2026: Mooncake Joins PyTorch Ecosystem We are thrilled to announce that Mooncake has officially joined the PyTorch Ecosystem!

  • Jan 28, 2026: FlexKV, a distributed KV store and cache system from Tencent and NVIDIA in collaboration with the community, now supports distributed KVCache reuse with the Mooncake Transfer Engine.

  • Dec 23, 2025: SGLang introduces Encode-Prefill-Decode (EPD) Disaggregation with Mooncake as a transfer backend. This integration allows decoupling compute-intensive multimodal encoders (e.g., Vision Transformers) from language model nodes, utilizing Mooncake’s RDMA engine for zero-copy transfer of large multimodal embeddings.

  • Dec 19, 2025: Mooncake Transfer Engine has been integrated into TensorRT LLM for KVCache transfer in PD-disaggregated inference.

  • Dec 19, 2025: Mooncake Transfer Engine has been directly integrated into vLLM v1 as a KV Connector in PD-disaggregated setups.

  • Nov 07, 2025: RBG + SGLang HiCache + Mooncake, a role-based out-of-the-box solution for cloud native deployment, which is elastic, scalable, and high-performance.

  • Sept 18, 2025: Mooncake Store empowers vLLM Ascend by serving as the distributed KV cache pool backend.

  • Sept 10, 2025: SGLang officially supports Mooncake Store as a hierarchical KV caching storage backend. The integration extends RadixAttention with multi-tier KV cache storage across device, host, and remote storage layers.

  • Sept 10, 2025: The official & high-performance version of Mooncake P2P Store is open-sourced as checkpoint-engine. It has been successfully applied in K1.5 and K2 production training, updating Kimi-K2 model (1T parameters) across thousands of GPUs in ~20s.

  • Aug 23, 2025: xLLM high-performance inference engine builds hybrid KV cache management based on Mooncake, supporting global KV cache management with intelligent offloading and prefetching.

  • Aug 18, 2025: vLLM-Ascend integrates Mooncake Transfer Engine for KV cache register and disaggregate prefill, enabling efficient distributed inference on Ascend NPUs.

  • Jul 20, 2025: Mooncake powers the deployment of Kimi K2 on 128 H200 GPUs with PD disaggregation and large-scale expert parallelism, achieving 224k tokens/sec prefill throughput and 288k tokens/sec decode throughput.

  • Jun 20, 2025: Mooncake becomes a PD disaggregation backend for LMDeploy.

  • May 9, 2025: NIXL officially supports Mooncake Transfer Engine as a backend plugin.

  • May 8, 2025: Mooncake x LMCache unite to pioneer KVCache-centric LLM serving system.

  • May 5, 2025: Supported by Mooncake Team, SGLang release guidance to deploy DeepSeek with PD Disaggregation on 96 H100 GPUs.

  • Apr 22, 2025: LMCache officially supports Mooncake Store as a remote connector.

  • Apr 10, 2025: SGLang officially supports Mooncake Transfer Engine for disaggregated prefilling and KV cache transfer.

  • Mar 7, 2025: We open-sourced the Mooncake Store, a distributed KVCache based on Transfer Engine. vLLM’s xPyD disaggregated prefilling & decoding based on Mooncake Store will be released soon.

  • Feb 25, 2025: Mooncake receives the Best Paper Award at FAST 2025!

  • Feb 21, 2025: The updated traces used in our FAST’25 paper have been released.

  • Dec 16, 2024: vLLM officially supports Mooncake Transfer Engine for disaggregated prefilling and KV cache transfer.

  • Nov 28, 2024: We open-sourced the Transfer Engine, the central component of Mooncake. We also provide two demonstrations of Transfer Engine: a P2P Store and vLLM integration.

  • July 9, 2024: We open-sourced the trace as a JSONL file.

  • June 27, 2024: We present a series of Chinese blogs with more discussions on zhihu 1, 2, 3, 4, 5, 6, 7.

  • June 26, 2024: Initial technical report release.

Documentation#

Community