AWS EFA Transport for Mooncake#
This document describes how to build and use Mooncake with AWS Elastic Fabric Adapter (EFA) support using libfabric.
Prerequisites#
1. AWS EFA Driver and libfabric#
EFA driver and libfabric should be pre-installed on AWS instances with EFA support (e.g., p6-b200.48xlarge, p5e.48xlarge, p4d.24xlarge).
Verify installation:
# Check EFA devices
fi_info -p efa
# Verify libfabric location
ls /opt/amazon/efa/lib/libfabric.so
ls /opt/amazon/efa/include/rdma/fabric.h
If not installed, follow AWS EFA documentation.
2. Build Dependencies#
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y \
build-essential \
cmake \
git \
libgflags-dev \
libgoogle-glog-dev \
libjsoncpp-dev \
libnuma-dev \
libibverbs-dev \
libboost-all-dev \
libcurl4-openssl-dev \
libyaml-cpp-dev \
libgtest-dev \
pybind11-dev \
python3-dev
# Install yalantinglibs (required)
cd /tmp
git clone https://github.com/alibaba/yalantinglibs.git
cd yalantinglibs
mkdir build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=/usr/local
make -j$(nproc)
sudo make install
Building Mooncake with EFA Support#
1. Clone the Repository#
git clone https://github.com/kvcache-ai/Mooncake.git
cd Mooncake
git submodule update --init --recursive
2. Build with EFA Enabled#
mkdir build && cd build
cmake .. \
-DUSE_EFA=ON \
-DUSE_CUDA=ON \
-DCMAKE_BUILD_TYPE=RelWithDebInfo
make -j$(nproc)
Note:
-DUSE_CUDA=ONis required when transferring GPU memory (e.g., KV cache in vLLM). Without it, the TCP transport (used as fallback whenmooncake_protocolis set to"tcp") cannot detect GPU memory and will fail with “Bad address” (EFAULT) errors.
3. Install Python Package#
# Copy built modules to wheel directory
cp mooncake-integration/engine.cpython-*.so ../mooncake-wheel/mooncake/
cp mooncake-asio/libasio.so ../mooncake-wheel/mooncake/
# Install with pip
pip install -e ../mooncake-wheel --no-build-isolation
Verification#
Test EFA transport initialization:
from mooncake.engine import TransferEngine
te = TransferEngine()
result = te.initialize('127.0.0.1', 'P2PHANDSHAKE', 'efa', '')
print(f'Initialize result: {result}') # Should be 0
# You should see logs like:
# EFA device (libfabric): rdmap79s0, domain: rdmap79s0-rdm, provider: efa
Usage with vLLM#
Prefill Instance#
VLLM_MOONCAKE_BOOTSTRAP_PORT=8998 \
vllm serve <model_path> -tp 8 \
--port 8010 \
--trust-remote-code \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer","kv_connector_extra_config":{"mooncake_protocol":"efa"}}'
Decode Instance#
vllm serve <model_path> -tp 8 \
--port 8020 \
--trust-remote-code \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer","kv_connector_extra_config":{"mooncake_protocol":"efa"}}'
Unit Tests#
Run the EFA transport unit tests (requires EFA hardware):
./build/mooncake-transfer-engine/tests/efa_transport_test
The test suite includes:
Test |
Description |
|---|---|
|
Verify EFA transport installation |
|
Loopback write operation |
|
Write then read with data integrity check |
|
Batch write (16 requests) |
|
Stress test (20 batches x 8 requests) |
You can also run all unit tests via CTest:
cd build && ctest --output-on-failure
Environment variables for test configuration:
export MC_METADATA_SERVER=P2PHANDSHAKE # default
export MC_LOCAL_SERVER_NAME=127.0.0.1:12345 # default
Performance Benchmark#
Use transfer_engine_bench to measure EFA transport throughput between two nodes.
Target Node (receiver)#
./build/mooncake-transfer-engine/example/transfer_engine_bench \
--mode=target \
--protocol=efa \
--metadata_server=P2PHANDSHAKE
Initiator Node (sender)#
./build/mooncake-transfer-engine/example/transfer_engine_bench \
--mode=initiator \
--protocol=efa \
--metadata_server=P2PHANDSHAKE \
--segment_id=<target_hostname>:<target_port> \
--operation=write \
--duration=10 \
--threads=8 \
--block_size=65536 \
--batch_size=128 \
--buffer_size=1073741824 \
--report_unit=GB
Replace <target_hostname>:<target_port> with the target node’s address shown in the target’s startup log (e.g., ip-172-31-29-226:12345).
Key Parameters#
Parameter |
Default |
Description |
|---|---|---|
|
65536 |
Bytes per transfer request |
|
128 |
Requests per batch |
|
12 |
Concurrent submission threads |
|
1 GB |
Total buffer size |
|
10 |
Test duration in seconds |
|
read |
|
|
GB |
|
Benchmark Results#
Tested on two p6-b200.48xlarge instances (8 EFA devices each, 8×400 Gbps) in the same AWS placement group.
Optimized Results#
With tuned parameters (MC_SLICE_SIZE=262144):
Operation |
Throughput |
Configuration |
|---|---|---|
Write |
167.63 GB/s |
threads=48, block_size=128KB, batch_size=128, MC_SLICE_SIZE=256KB |
Read |
171.89 GB/s |
threads=48, block_size=128KB, batch_size=128, MC_SLICE_SIZE=256KB |
Parameter Tuning Results#
The following table shows how different parameters affect write throughput:
block_size |
threads |
batch_size |
MC_SLICE_SIZE |
Throughput |
|---|---|---|---|---|
64KB |
8 |
128 |
default (64KB) |
69.47 GB/s |
256KB |
8 |
128 |
default |
70.09 GB/s |
64KB |
16 |
128 |
default |
78.80 GB/s |
64KB |
32 |
256 |
default |
87.65 GB/s |
64KB |
64 |
256 |
default |
85.72 GB/s |
128KB |
32 |
128 |
default |
92.33 GB/s |
128KB |
32 |
128 |
128KB |
152.26 GB/s |
128KB |
32 |
128 |
256KB |
156.18 GB/s |
128KB |
48 |
128 |
256KB |
160.34 GB/s |
128KB |
64 |
128 |
256KB |
158.82 GB/s |
Key findings:
MC_SLICE_SIZE is the most impactful tuning parameter — increasing from default 64KB to 256KB nearly doubles throughput (92→160 GB/s)
block_size=128KB outperforms 64KB by ~10-15%
threads=48 is optimal for 8 EFA devices; 64 threads shows slight diminishing returns
batch_size=128 is sufficient; increasing to 256+ causes “Cannot select device” errors at higher thread counts
Cross-Transport Comparison#
Transport |
Throughput |
Per-NIC Bandwidth |
Notes |
|---|---|---|---|
EFA (tuned) |
168-172 GB/s |
~207-214 Gbps × 8 NICs |
MC_SLICE_SIZE=256KB, threads=48 |
EFA (default) |
69.47 GB/s |
~86 Gbps × 8 NICs |
Default parameters |
TCP (iperf3 baseline) |
9.5 GB/s |
76 Gbps total |
Kernel TCP stack, 8 parallel streams |
TCP (Mooncake) |
0.11 GB/s |
— |
Mooncake TCP transport, unoptimized for throughput |
EFA (tuned) vs TCP: EFA delivers 17.7x the raw TCP bandwidth by bypassing the kernel network stack.
EFA vs RoCE RDMA: On comparable 8×400 Gbps RoCE networks, Mooncake’s RDMA transport achieves ~190 GB/s. Tuned EFA reaches ~88% of RoCE performance, demonstrating that proper parameter tuning can largely close the gap between SRD-based EFA and hardware-offloaded RDMA.
Tuning Tips#
Set
MC_SLICE_SIZE=262144(256KB) — this is the single most important tuning knob, nearly doubling throughput from defaultsIncrease
--threadsto 32-48 to saturate multiple EFA devices (6 threads per device is a good starting point)Use
--block_size=131072(128KB) for optimal per-request efficiencyKeep
--batch_size=128; higher values may cause device selection failures with many threadsAllocate buffers on both NUMA nodes for balanced NIC utilization (the bench tool does this by default)
Avoid
--block_size=256KBor larger with many threads — this can trigger “Cannot select device” errors due to buffer boundary alignment across 8 EFA devices
Technical Details#
Why libfabric instead of ibverbs?#
AWS EFA exposes RDMA-like devices through the ibverbs interface, but does not support the full ibverbs API. Specifically:
Queue Pair (QP) creation fails with “Operation not supported” (error 95)
EFA requires using libfabric’s
FI_EP_RDM(Reliable Datagram Message) endpoint type
EFA Transport Architecture#
┌─────────────────────────────────────────────────────┐
│ EfaTransport │
├─────────────────────────────────────────────────────┤
│ EfaContext (per device) │
│ ├── fi_info (fabric info) │
│ ├── fid_fabric (fabric handle) │
│ ├── fid_domain (protection domain) │
│ ├── fid_av (address vector for peer lookup) │
│ ├── fid_cq (completion queues) │
│ └── fid_mr (memory regions) │
├─────────────────────────────────────────────────────┤
│ EfaEndpoint (per connection) │
│ ├── fid_ep (RDM endpoint) │
│ ├── fi_addr_t (peer address) │
│ └── local_addr (local endpoint address) │
└─────────────────────────────────────────────────────┘
Thread Safety#
The EFA transport requests FI_THREAD_SAFE from the libfabric provider and adds per-endpoint spinlocks to serialize fi_write calls. This is necessary because:
Multiple submission threads may route slices to the same endpoint concurrently
libfabric RDM endpoints default to
FI_THREAD_UNSPEC(no thread safety guarantees)Concurrent
fi_writewithout serialization corrupts provider internals, causing completions to silently vanish
CQ completion queues are polled by dedicated worker threads (one per EFA device) that run independently of submission threads.
EFA vs RoCE RDMA#
Feature |
EFA (libfabric SRD) |
RoCE (ibverbs) |
|---|---|---|
Protocol |
Scalable Reliable Datagram |
RDMA over Converged Ethernet |
Endpoint type |
|
Queue Pairs (true RDMA) |
Write operation |
Software-emulated via messages + ACKs |
Hardware-offloaded one-sided RDMA |
CPU overhead |
Moderate (provider processes ACKs) |
Minimal (NIC handles everything) |
Throughput (8×400G) |
~170 GB/s (tuned) |
~190 GB/s |
AWS availability |
All EFA-enabled instances |
Not available on AWS |
Supported AWS Instance Types#
p6-b200.48xlarge (8 EFA devices,
rdmap*naming)p5e.48xlarge (16 EFA devices,
rdmap*naming)p4d.24xlarge (4 EFA devices)
Other EFA-enabled instances
Use fi_info -p efa to list available EFA devices on your instance.
Troubleshooting#
No EFA devices found#
EfaTransport: No EFA devices found
Solution: Verify EFA is available with fi_info -p efa
Permission denied#
fi_fabric failed: Permission denied
Solution: Ensure proper permissions or run with sudo for testing
libfabric not found#
cannot find -lfabric
Solution: Verify /opt/amazon/efa/lib is in the library path:
export LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH
Workers hang under high concurrency#
If transfer_engine_bench hangs with some workers never completing:
Ensure both nodes are running the same build — the CQ backpressure and thread-safety fixes must be present on both sides
Reduce concurrency to verify basic connectivity:
--threads=1 --batch_size=16Check CQ poller threads: logs should show “Started N CQ polling worker threads” where N matches the number of EFA devices