Mooncake Transfer Engine Benchmark Tool (tebench) Guide#
tebench is an end-to-end benchmarking tool for the Mooncake Transfer Engine.
It evaluates bandwidth and latency across different (block_size, batch_size, concurrency) combinations, and supports both the classic TE backend and the new TENT backend.
1. Build and Deployment#
1.1 Prerequisites#
A C++ toolchain capable of building the project
Optional: GPU(s) and corresponding transport stacks (RDMA, shared memory, io_uring, etc.)
Exact dependencies (CUDA / OFED / io_uring / etc.) depend on the transport backend being used.
1.2 Build with TENT Enabled#
Typical out-of-tree build:
mkdir -p build
cd build
cmake .. -DUSE_TENT=ON
cmake --build . -j
2. Execution Model#
tebench runs in two roles:
Target (receiver): creates a memory segment and waits for connections
Initiator (sender): connects to a target and executes benchmark cases
Role Selection#
If
--target_seg_nameis empty → Target modeIf
--target_seg_nameis set → Initiator mode
Backend Selection#
--backend=classic→ classic TE implementation--backend=tent→ TENT implementation
3. Quick Start#
3.1 Start Target (First)#
On the target machine:
./tebench \
--seg_type=DRAM \
--backend=tent
The program will print a command containing a generated segment name:
To start initiators, run
./tebench --target_seg_name=<SEG_NAME> --seg_type=DRAM --backend=tent
Press Ctrl-C to terminate
Copy the printed --target_seg_name to the initiator side.
3.2 Start Initiator#
On the initiator machine:
./tebench \
--target_seg_name=<SEG_NAME_FROM_TARGET> \
--seg_type=DRAM \
--backend=tent \
--op_type=read \
--duration=5
4. Output Metrics#
Each output row corresponds to one benchmark configuration.
Column |
Description |
|---|---|
|
Block size per request (bytes) |
|
Number of requests per submission |
|
Throughput (total bytes / total time) |
|
Average end-to-end latency (scaled by thread count) |
|
Average per-transfer execution time |
|
P99 transfer latency |
|
P999 transfer latency |
A short (~1 second) warmup phase is executed before measurements begin.
5. Runtime Configuration#
This section summarizes the key runtime options that control workload behavior, resource usage, and backend selection.
5.1 Workload Mode (--op_type)#
Controls the transfer pattern executed by the initiator:
read— repeated READ transferswrite— repeated WRITE transfersmix— alternating WRITE followed by READ (both recorded)
Example:
--op_type=write
5.2 Data Consistency Check (--check_consistency)#
When enabled, the benchmark validates correctness in mix mode:
Source buffers are filled with a known pattern before WRITE
Data is READ back and verified on the initiator
Example:
./tebench --target_seg_name=<SEG> --op_type=mix --check_consistency=true
Consistency checking introduces CPU-side overhead and should be disabled for pure performance measurements.
5.3 Segment, Concurrency, and Memory Layout#
Segment and role
--seg_name: local segment name (typically left empty)--seg_type:DRAM | VRAM(default:DRAM)--target_seg_name: target segment name (empty → Target mode)
Scan ranges
--total_buffer_size: total buffer limit (bytes)--start_block_size,--max_block_size: block size sweep (powers of two)--start_batch_size,--max_batch_size: batch size sweep (powers of two)--start_num_threads,--max_num_threads: thread sweep (powers of two)--duration: measurement time per case (seconds)
A test case is skipped when:
block_size × batch_size × num_threads > total_buffer_size
5.4 GPU Affinity#
--local_gpu_id: initiator base GPU ID--target_gpu_id: target base GPU ID
Each worker thread uses:
gpu_id + thread_id
5.5 Backend, Transport, and Metadata#
Backend
--backend:classic | tent(default:tent)
Transport (TENT only)
--xport_type:rdma | shm | mnnvl | gds | iouring
Metadata service
--metadata_type:p2p | etcd | redis | http(default:p2p)--metadata_url_list: comma-separated URLs (ignored inp2pmode)