AWS EFA Transport for Mooncake

AWS EFA Transport for Mooncake#

This document describes how to build and use Mooncake with AWS Elastic Fabric Adapter (EFA) support using libfabric.

Prerequisites#

1. AWS EFA Driver and libfabric#

EFA driver and libfabric should be pre-installed on AWS instances with EFA support (e.g., p6-b200.48xlarge, p5e.48xlarge, p4d.24xlarge).

Verify installation:

# Check EFA devices
fi_info -p efa

# Verify libfabric location
ls /opt/amazon/efa/lib/libfabric.so
ls /opt/amazon/efa/include/rdma/fabric.h

If not installed, follow AWS EFA documentation.

2. Build Dependencies#

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y \
    build-essential \
    cmake \
    git \
    libgflags-dev \
    libgoogle-glog-dev \
    libjsoncpp-dev \
    libnuma-dev \
    libibverbs-dev \
    libboost-all-dev \
    libcurl4-openssl-dev \
    libyaml-cpp-dev \
    libgtest-dev \
    pybind11-dev \
    python3-dev

# Install yalantinglibs (required)
cd /tmp
git clone https://github.com/alibaba/yalantinglibs.git
cd yalantinglibs
mkdir build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=/usr/local
make -j$(nproc)
sudo make install

Building Mooncake with EFA Support#

1. Clone the Repository#

git clone https://github.com/kvcache-ai/Mooncake.git
cd Mooncake
git submodule update --init --recursive

2. Build with EFA Enabled#

mkdir build && cd build

cmake .. \
    -DUSE_EFA=ON \
    -DUSE_CUDA=ON \
    -DCMAKE_BUILD_TYPE=RelWithDebInfo

make -j$(nproc)

Note: -DUSE_CUDA=ON is required when transferring GPU memory (e.g., KV cache in vLLM). Without it, the TCP transport (used as fallback when mooncake_protocol is set to "tcp") cannot detect GPU memory and will fail with “Bad address” (EFAULT) errors.

3. Install Python Package#

# Copy built modules to wheel directory
cp mooncake-integration/engine.cpython-*.so ../mooncake-wheel/mooncake/
cp mooncake-asio/libasio.so ../mooncake-wheel/mooncake/

# Install with pip
pip install -e ../mooncake-wheel --no-build-isolation

Verification#

Test EFA transport initialization:

from mooncake.engine import TransferEngine

te = TransferEngine()
result = te.initialize('127.0.0.1', 'P2PHANDSHAKE', 'efa', '')
print(f'Initialize result: {result}')  # Should be 0

# You should see logs like:
# EFA device (libfabric): rdmap79s0, domain: rdmap79s0-rdm, provider: efa

Usage with vLLM#

Prefill Instance#

VLLM_MOONCAKE_BOOTSTRAP_PORT=8998 \
vllm serve <model_path> -tp 8 \
   --port 8010 \
   --trust-remote-code \
   --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer","kv_connector_extra_config":{"mooncake_protocol":"efa"}}'

Decode Instance#

vllm serve <model_path> -tp 8 \
   --port 8020 \
   --trust-remote-code \
   --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer","kv_connector_extra_config":{"mooncake_protocol":"efa"}}'

Unit Tests#

Run the EFA transport unit tests (requires EFA hardware):

./build/mooncake-transfer-engine/tests/efa_transport_test

The test suite includes:

Test	Description
`InstallTransport`	Verify EFA transport installation
`LoopbackWrite`	Loopback write operation
`WriteAndRead`	Write then read with data integrity check
`MultiWrite`	Batch write (16 requests)
`StressMultipleBatches`	Stress test (20 batches x 8 requests)

You can also run all unit tests via CTest:

cd build && ctest --output-on-failure

Environment variables for test configuration:

export MC_METADATA_SERVER=P2PHANDSHAKE     # default
export MC_LOCAL_SERVER_NAME=127.0.0.1:12345  # default

Performance Benchmark#

Use transfer_engine_bench to measure EFA transport throughput between two nodes.

Target Node (receiver)#

./build/mooncake-transfer-engine/example/transfer_engine_bench \
    --mode=target \
    --protocol=efa \
    --metadata_server=P2PHANDSHAKE

Initiator Node (sender)#

./build/mooncake-transfer-engine/example/transfer_engine_bench \
    --mode=initiator \
    --protocol=efa \
    --metadata_server=P2PHANDSHAKE \
    --segment_id=<target_hostname>:<target_port> \
    --operation=write \
    --duration=10 \
    --threads=8 \
    --block_size=65536 \
    --batch_size=128 \
    --buffer_size=1073741824 \
    --report_unit=GB

Replace <target_hostname>:<target_port> with the target node’s address shown in the target’s startup log (e.g., ip-172-31-29-226:12345).

Key Parameters#

Parameter	Default	Description
`--block_size`	65536	Bytes per transfer request
`--batch_size`	128	Requests per batch
`--threads`	12	Concurrent submission threads
`--buffer_size`	1 GB	Total buffer size
`--duration`	10	Test duration in seconds
`--operation`	read	`read` or `write`
`--report_unit`	GB	`GB\|GiB\|Gb\|MB\|MiB\|Mb`

Benchmark Results#

Tested on two p6-b200.48xlarge instances (8 EFA devices each, 8×400 Gbps) in the same AWS placement group.

Optimized Results#

With tuned parameters (MC_SLICE_SIZE=262144):

Operation	Throughput	Configuration
Write	167.63 GB/s	threads=48, block_size=128KB, batch_size=128, MC_SLICE_SIZE=256KB
Read	171.89 GB/s	threads=48, block_size=128KB, batch_size=128, MC_SLICE_SIZE=256KB

Parameter Tuning Results#

The following table shows how different parameters affect write throughput:

block_size	threads	batch_size	MC_SLICE_SIZE	Throughput
64KB	8	128	default (64KB)	69.47 GB/s
256KB	8	128	default	70.09 GB/s
64KB	16	128	default	78.80 GB/s
64KB	32	256	default	87.65 GB/s
64KB	64	256	default	85.72 GB/s
128KB	32	128	default	92.33 GB/s
128KB	32	128	128KB	152.26 GB/s
128KB	32	128	256KB	156.18 GB/s
128KB	48	128	256KB	160.34 GB/s
128KB	64	128	256KB	158.82 GB/s

Key findings:

MC_SLICE_SIZE is the most impactful tuning parameter — increasing from default 64KB to 256KB nearly doubles throughput (92→160 GB/s)
block_size=128KB outperforms 64KB by ~10-15%
threads=48 is optimal for 8 EFA devices; 64 threads shows slight diminishing returns
batch_size=128 is sufficient; increasing to 256+ causes “Cannot select device” errors at higher thread counts

Cross-Transport Comparison#

Transport	Throughput	Per-NIC Bandwidth	Notes
EFA (tuned)	168-172 GB/s	~207-214 Gbps × 8 NICs	MC_SLICE_SIZE=256KB, threads=48
EFA (default)	69.47 GB/s	~86 Gbps × 8 NICs	Default parameters
TCP (iperf3 baseline)	9.5 GB/s	76 Gbps total	Kernel TCP stack, 8 parallel streams
TCP (Mooncake)	0.11 GB/s	—	Mooncake TCP transport, unoptimized for throughput

EFA (tuned) vs TCP: EFA delivers 17.7x the raw TCP bandwidth by bypassing the kernel network stack.

EFA vs RoCE RDMA: On comparable 8×400 Gbps RoCE networks, Mooncake’s RDMA transport achieves ~190 GB/s. Tuned EFA reaches ~88% of RoCE performance, demonstrating that proper parameter tuning can largely close the gap between SRD-based EFA and hardware-offloaded RDMA.

Tuning Tips#

Set MC_SLICE_SIZE=262144 (256KB) — this is the single most important tuning knob, nearly doubling throughput from defaults
Increase --threads to 32-48 to saturate multiple EFA devices (6 threads per device is a good starting point)
Use --block_size=131072 (128KB) for optimal per-request efficiency
Keep --batch_size=128; higher values may cause device selection failures with many threads
Allocate buffers on both NUMA nodes for balanced NIC utilization (the bench tool does this by default)
Avoid --block_size=256KB or larger with many threads — this can trigger “Cannot select device” errors due to buffer boundary alignment across 8 EFA devices

Technical Details#

Why libfabric instead of ibverbs?#

AWS EFA exposes RDMA-like devices through the ibverbs interface, but does not support the full ibverbs API. Specifically:

Queue Pair (QP) creation fails with “Operation not supported” (error 95)
EFA requires using libfabric’s FI_EP_RDM (Reliable Datagram Message) endpoint type

EFA Transport Architecture#

┌─────────────────────────────────────────────────────┐
│                   EfaTransport                       │
├─────────────────────────────────────────────────────┤
│  EfaContext (per device)                            │
│  ├── fi_info      (fabric info)                     │
│  ├── fid_fabric   (fabric handle)                   │
│  ├── fid_domain   (protection domain)               │
│  ├── fid_av       (address vector for peer lookup)  │
│  ├── fid_cq       (completion queues)               │
│  └── fid_mr       (memory regions)                  │
├─────────────────────────────────────────────────────┤
│  EfaEndpoint (per connection)                       │
│  ├── fid_ep       (RDM endpoint)                    │
│  ├── fi_addr_t    (peer address)                    │
│  └── local_addr   (local endpoint address)          │
└─────────────────────────────────────────────────────┘

Thread Safety#

The EFA transport requests FI_THREAD_SAFE from the libfabric provider and adds per-endpoint spinlocks to serialize fi_write calls. This is necessary because:

Multiple submission threads may route slices to the same endpoint concurrently
libfabric RDM endpoints default to FI_THREAD_UNSPEC (no thread safety guarantees)
Concurrent fi_write without serialization corrupts provider internals, causing completions to silently vanish

CQ completion queues are polled by dedicated worker threads (one per EFA device) that run independently of submission threads.

EFA vs RoCE RDMA#

Feature	EFA (libfabric SRD)	RoCE (ibverbs)
Protocol	Scalable Reliable Datagram	RDMA over Converged Ethernet
Endpoint type	`FI_EP_RDM` (message-based)	Queue Pairs (true RDMA)
Write operation	Software-emulated via messages + ACKs	Hardware-offloaded one-sided RDMA
CPU overhead	Moderate (provider processes ACKs)	Minimal (NIC handles everything)
Throughput (8×400G)	~170 GB/s (tuned)	~190 GB/s
AWS availability	All EFA-enabled instances	Not available on AWS

Supported AWS Instance Types#

p6-b200.48xlarge (8 EFA devices, rdmap* naming)
p5e.48xlarge (16 EFA devices, rdmap* naming)
p4d.24xlarge (4 EFA devices)
Other EFA-enabled instances

Use fi_info -p efa to list available EFA devices on your instance.

Troubleshooting#

No EFA devices found#

EfaTransport: No EFA devices found

Solution: Verify EFA is available with fi_info -p efa

Permission denied#

fi_fabric failed: Permission denied

Solution: Ensure proper permissions or run with sudo for testing

libfabric not found#

cannot find -lfabric

Solution: Verify /opt/amazon/efa/lib is in the library path:

export LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH

Workers hang under high concurrency#

If transfer_engine_bench hangs with some workers never completing:

Ensure both nodes are running the same build — the CQ backpressure and thread-safety fixes must be present on both sides
Reduce concurrency to verify basic connectivity: --threads=1 --batch_size=16
Check CQ poller threads: logs should show “Started N CQ polling worker threads” where N matches the number of EFA devices