Allocator Performance#
Mooncake Store uses two allocators at different layers:
MmapArena — backs the mmap buffer path used by SGLang HiCache for host KV cache memory. When explicitly enabled, it pre-allocates a hugepage-backed pool at startup and serves allocations via lock-free atomic bump pointer (~50ns per allocation). Configured via
MC_MMAP_ARENA_POOL_SIZE(explicit opt-in) andMC_DISABLE_MMAP_ARENA.OffsetAllocator — manages object-level allocation within Mooncake Store segments for KV cache entries. Evaluated below.
These are complementary: MmapArena handles the backing memory, OffsetAllocator manages logical object placement within that memory.
OffsetAllocator#
We evaluated the performance of OffsetAllocator, the default memory allocator in Mooncake Store. This allocator is responsible for allocating memory from mounted segments to store the KV cache.
In this context, the most important metric is memory utilization, defined as the ratio between the amount of memory that can be successfully allocated and the total available memory. A higher utilization means that more KV tensors can be cached, thereby accelerating LLM tasks. However, due to memory fragmentation, allocation may fail even when the allocated memory is well below the total available capacity.
For the same allocator, memory utilization can vary significantly under different workloads. Therefore, we evaluated the allocator’s efficiency across a range of workloads.
In particular, in the LLM inference scenario, once the model is fixed, the size of each KV vector is also fixed. This means memory utilization under uniform allocation sizes becomes especially important. However, the original OffsetAllocator has a limitation: when the allocation size is uniform but does not match any of OffsetAllocator’s predefined bin sizes, memory utilization can be suboptimal.
To address this, we introduced targeted optimizations for uniform-size workloads on top of the original OffsetAllocator. As shown in our test results, the optimized version achieves significant performance improvements in such scenarios.
Execution#
./mooncake-store/benchmarks/allocator_bench
Result#
alloc size: The size of each object
utilization ratio: The total allocated size / total space
time: time in nanoseconds for each object allocation
OffsetAllocator optimization: round up the allocated size to a bin size.
Uniform size, size equals power of 2#
OffsetAllocator (After Optimization)
Alloc Size |
Min Util Ratio |
Avg Util Ratio |
Time (ns) |
|---|---|---|---|
32 |
1 |
1 |
544 |
128 |
1 |
1 |
417 |
512 |
1 |
1 |
174 |
2048 |
1 |
1 |
406 |
8192 |
1 |
1 |
180 |
32768 |
1 |
1 |
133 |
131072 |
1 |
1 |
109 |
524288 |
1 |
1 |
100 |
2097152 |
1 |
1 |
99 |
8388608 |
1 |
1 |
99 |
33554432 |
1 |
1 |
98 |
OffsetAllocator (Before Optimization)
Alloc Size |
Min Util Ratio |
Avg Util Ratio |
Time (ns) |
|---|---|---|---|
32 |
1 |
1 |
539 |
128 |
1 |
1 |
419 |
512 |
1 |
1 |
217 |
2048 |
1 |
1 |
408 |
8192 |
1 |
1 |
175 |
32768 |
1 |
1 |
130 |
131072 |
1 |
1 |
107 |
524288 |
1 |
1 |
99 |
2097152 |
1 |
1 |
100 |
8388608 |
1 |
1 |
98 |
33554432 |
1 |
1 |
98 |
Uniform size, size equals power of 2 +/- 17#
OffsetAllocator (After Optimization)
Alloc Size |
Min Util Ratio |
Avg Util Ratio |
Time (ns) |
|---|---|---|---|
15 |
1 |
1 |
568 |
111 |
0.991071 |
0.991071 |
441 |
495 |
0.966797 |
0.966797 |
178 |
2031 |
0.991699 |
0.991699 |
418 |
8175 |
0.997925 |
0.997925 |
170 |
32751 |
0.999481 |
0.999481 |
133 |
131055 |
0.99987 |
0.99987 |
109 |
524271 |
0.999968 |
0.999968 |
100 |
2097135 |
0.999992 |
0.999992 |
99 |
8388591 |
0.999998 |
0.999998 |
98 |
33554415 |
0.999999 |
0.999999 |
99 |
49 |
0.942308 |
0.942308 |
508 |
145 |
0.906249 |
0.906249 |
372 |
529 |
0.918399 |
0.918399 |
172 |
2065 |
0.896267 |
0.896267 |
403 |
8209 |
0.89073 |
0.89073 |
174 |
32785 |
0.889347 |
0.889347 |
131 |
131089 |
0.88897 |
0.88897 |
105 |
524305 |
0.888701 |
0.888701 |
102 |
2097169 |
0.888679 |
0.888679 |
100 |
8388625 |
0.886721 |
0.886721 |
100 |
33554449 |
0.875 |
0.875 |
100 |
OffsetAllocator (Before Optimization)
Alloc Size |
Min Util Ratio |
Avg Util Ratio |
Time (ns) |
|---|---|---|---|
15 |
1 |
1 |
566 |
111 |
0.669866 |
0.710845 |
703 |
495 |
0.665779 |
0.676874 |
238 |
2031 |
0.668333 |
0.705411 |
637 |
8175 |
0.666175 |
0.676474 |
242 |
32751 |
0.664435 |
0.669078 |
168 |
131055 |
0.66062 |
0.667341 |
124 |
524271 |
0.653055 |
0.666993 |
118 |
2097135 |
0.64062 |
0.666873 |
116 |
8388591 |
0.605468 |
0.667812 |
115 |
33554415 |
0.5625 |
0.670944 |
116 |
49 |
0.692229 |
0.753062 |
1122 |
145 |
0.667789 |
0.700907 |
572 |
529 |
0.66577 |
0.676238 |
238 |
2065 |
0.667926 |
0.704884 |
632 |
8209 |
0.665708 |
0.676372 |
239 |
32785 |
0.664224 |
0.669058 |
168 |
131089 |
0.659631 |
0.667287 |
129 |
524305 |
0.652609 |
0.666884 |
122 |
2097169 |
0.638677 |
0.666516 |
120 |
8388625 |
0.60547 |
0.665131 |
121 |
33554449 |
0.546875 |
0.660917 |
120 |
Uniform size, size equals power of 2 multiply 0.9 or 1.1#
OffsetAllocator (After Optimization)
Alloc Size |
Min Util Ratio |
Avg Util Ratio |
Time (ns) |
|---|---|---|---|
28 |
1 |
1 |
543 |
115 |
0.958333 |
0.958333 |
418 |
460 |
0.958332 |
0.958332 |
189 |
1843 |
0.959896 |
0.959896 |
418 |
7372 |
0.959895 |
0.959895 |
197 |
29491 |
0.959993 |
0.959993 |
135 |
117964 |
0.959979 |
0.959979 |
111 |
471859 |
0.959985 |
0.959985 |
100 |
1887436 |
0.959765 |
0.959765 |
99 |
7549747 |
0.959766 |
0.959766 |
99 |
30198988 |
0.95625 |
0.95625 |
99 |
35 |
0.972222 |
0.972222 |
531 |
140 |
0.972222 |
0.972222 |
397 |
563 |
0.977427 |
0.977427 |
180 |
2252 |
0.97743 |
0.97743 |
389 |
9011 |
0.977752 |
0.977752 |
183 |
36044 |
0.977752 |
0.977752 |
133 |
144179 |
0.977739 |
0.977739 |
106 |
576716 |
0.977538 |
0.977538 |
103 |
2306867 |
0.977539 |
0.977539 |
99 |
9227468 |
0.975391 |
0.975391 |
99 |
36909875 |
0.9625 |
0.9625 |
100 |
OffsetAllocator (Before Optimization)
Alloc Size |
Min Util Ratio |
Avg Util Ratio |
Time (ns) |
|---|---|---|---|
28 |
1 |
1 |
539 |
115 |
0.669299 |
0.709245 |
701 |
460 |
0.665825 |
0.677532 |
255 |
1843 |
0.669352 |
0.709202 |
691 |
7372 |
0.66619 |
0.677401 |
260 |
29491 |
0.664311 |
0.669511 |
172 |
117964 |
0.661812 |
0.667356 |
133 |
471859 |
0.654345 |
0.667048 |
123 |
1887436 |
0.640722 |
0.666447 |
121 |
7549747 |
0.611719 |
0.666847 |
119 |
30198988 |
0.548437 |
0.669799 |
125 |
35 |
0.7098 |
0.774162 |
1306 |
140 |
0.667934 |
0.702151 |
599 |
563 |
0.665599 |
0.675548 |
239 |
2252 |
0.667371 |
0.701623 |
601 |
9011 |
0.665485 |
0.675528 |
244 |
36044 |
0.663248 |
0.668912 |
170 |
144179 |
0.660308 |
0.666934 |
127 |
576716 |
0.654467 |
0.66679 |
122 |
2306867 |
0.633789 |
0.666159 |
121 |
9227468 |
0.597266 |
0.666037 |
118 |
36909875 |
0.55 |
0.669564 |
121 |
Random Size#
OffsetAllocator (After Optimization)
util ratio (min / p99 / p90 / p50 / max / avg):
0.544250 / 0.713338 / 0.779739 / 0.847867 / 0.952591 / 0.841576
avg alloc time: 145.575738 ns/op
OffsetAllocator (Before Optimization)
util ratio (min / p99 / p90 / p50 / max / avg):
0.569255 / 0.712076 / 0.781224 / 0.855046 / 0.976057 / 0.848873
avg alloc time: 142.508508 ns/op
OffsetAllocator (Before Optimization)
util ratio (min / p99 / p90 / p50 / max / avg):
0.569255 / 0.712076 / 0.781224 / 0.855046 / 0.976057 / 0.848873
avg alloc time: 142.508508 ns/op
Paired KV/Indexer Allocation Benchmark (DSA)#
In the DSA scenario, Mooncake Store stores both KV cache objects and indexer objects. We evaluated OffsetAllocator under a paired allocation workload using object sizes derived from a GLM-5.1-FP8 DSA serving configuration.
In this configuration, Mooncake Store stores data at the page granularity, where each page contains 64 tokens. Therefore, the object sizes observed by the allocator are the packed page-level sizes rather than the per-token sizes:
KV cache object size: 3.12 MB
Indexer object size: 643 KB
Each benchmark round samples N uniformly from [1, 128], then issues N KV cache allocations followed by N indexer allocations. The pool size is 1024 GB. On allocation failure, the benchmark randomly evicts 5% of all live objects and retries. The utilization ratio below is sampled only when eviction is triggered, immediately before evicting objects.
OffsetAllocator
util ratio (min / p99 / p90 / p50 / max / avg):
0.951814 / 0.951879 / 0.952077 / 0.952382 / 0.952621 / 0.952343
eviction-trigger samples: 243
avg alloc time: 352.207986 ns/op