Mooncake PG/EP Troubleshooting#
This page covers common setup, import, runtime, and recovery issues for Mooncake Backend (PG) and Mooncake EP.
Import fails with a PyTorch version error#
Symptoms:
Mooncake PG was not built against torch==...
Mooncake EP was not built against torch==...
Cause:
mooncake.pg and mooncake.ep load native extension modules whose names include
the current PyTorch version. If the wheel or source build does not contain an
extension for the active torch.__version__, import fails.
Fixes:
Verify the active PyTorch version:
python - <<'PY' import torch print(torch.__version__) print(torch.version.cuda) PY
Install a Mooncake wheel built for that PyTorch/CUDA combination, or rebuild from source with EP/PG enabled:
cmake .. -DWITH_EP=ON make -j
Make sure the Python environment used at runtime is the same one used for the build.
activeRanks must be int or device mismatch#
Symptoms:
activeRanks must be int.
activeRanks must be on CPU.
activeRanks must be on GPU.
activeRanks must be sized to max_world_size when max_world_size is set
Causes and fixes:
Use
torch.int32, notbool,int64, or floating-point types.For
backend="mooncake", putactive_rankson the accelerator device.For
backend="mooncake-cpu", putactive_rankson CPU.If
max_world_sizeis set, allocateactive_rankswith lengthmax_world_size, even if the initial visibleworld_sizeis smaller.
Examples:
# CUDA / accelerator backend
active_ranks = torch.ones(max_world_size, dtype=torch.int32, device="cuda")
# CPU backend
active_ranks = torch.ones(max_world_size, dtype=torch.int32)
dist.get_world_size() is smaller than max_world_size#
This is expected. Mooncake PG distinguishes reserved capacity from visible active membership:
max_world_sizereserves future rank slots.dist.get_world_size()returns the visible active size.Reserved ranks are inactive until
recover_ranks()activates them.
Use pg.get_active_ranks(backend) to inspect the current backend mask.
get_peer_state() or join_group() hangs#
Common causes:
Healthy ranks are not all calling
get_peer_state()in the same order.The joining rank did not initialize with
is_extension=True.max_world_size/ rank numbering differs between healthy and joining ranks.Subgroups were created in different orders on healthy and joining processes.
The joining process has not published peer metadata yet.
Network device filters differ across ranks.
Debug checklist:
Confirm all ranks use the same rendezvous address and store.
Print rank, visible
world_size, andmax_world_sizeat initialization.Confirm
active_rankslength and values on every rank.Verify healthy ranks poll the same
join_rankslist.For subgroup recovery, confirm every rank calls
dist.new_group()in the same order.If using RDMA, set the same HCA whitelist on every rank.
Newly extended ranks participate too early#
After pg.extend_group_size_to(backend, new_size), new ranks are reserved but
inactive. They should not participate in collectives until healthy ranks call
pg.recover_ranks(backend, ranks).
If a new rank appears to participate early, check whether:
active_rankswas initialized with1for future ranks without masking them through the backend protocol;the application called collectives on the joining process before
pg.join_group()returned;different ranks used inconsistent
max_world_sizeor rank IDs.
EP dispatch/combine timeout marks a rank inactive#
Mooncake EP kernels can update the EP-level active_ranks tensor when a source
rank does not make progress before timeout_us.
If this happens unexpectedly:
Increase
timeout_usto rule out slow startup or transient scheduling delay.Check whether the source rank exited, crashed, or skipped the matching dispatch/combine call.
Confirm
num_experts % num_ranks == 0and that every rank uses the samenum_experts,top_k, and buffer sizing assumptions.Ensure all ranks call dispatch and combine in the same order.
If timeout detection is not desired for a test, pass timeout_us=-1.
EP falls back instead of using the fast path#
The Python wrapper uses fallback when the native runtime cannot use IBGDA/RDMA or fully accessible P2P.
Possible causes:
RDMA devices or drivers are not available inside the container.
HCA selection picked the wrong device.
GPUDirect RDMA / peer-memory support is missing.
CUDA IPC or peer access is unavailable between local GPUs.
The environment was built without the required accelerator support.
Debug checklist:
Restrict HCA selection with
pg.set_device_filter([...])orMOONCAKE_PGTEST_DEVICE_FILTERS.Check whether all intended ranks can see the same accelerator and RDMA devices.
Run PG collectives first; EP metadata exchange depends on a healthy process group.
Run EP tests with fallback enabled and disabled to separate correctness from transport setup.
RDMA connection errors or severe latency from wrong HCA selection#
Symptoms can include connection setup failures, repeated transport timeouts, unexpected fallback, very low bandwidth, or RDMA retry-style errors when the backend auto-selects an unsuitable NIC/HCA. This commonly happens on machines with multiple RDMA devices where some devices are for management traffic, are on different fabrics, or are not reachable from peer ranks.
Fix: set an explicit device filter before initializing Mooncake PG/EP so all ranks use the intended HCA list.
For application code:
from mooncake import pg
pg.set_device_filter(["mlx5_1", "mlx5_2"])
# call dist.init_process_group(...) after setting the filter
For PG tests and benchmarks:
MOONCAKE_PGTEST_DEVICE_FILTERS=mlx5_1,mlx5_2 \
python -m unittest discover -s mooncake-pg/tests -k CUDA -v
Use the same filter on every rank. If the problem disappears after setting the filter, treat the original failure as a topology/device-selection issue rather than an EP kernel or collective correctness problem.
EP output mismatch or buffer overflow symptoms#
Common causes:
num_max_dispatch_tokens_per_rankis smaller than the actual per-rank token count.num_expertsis not divisible bynum_ranks.topk_idxcontains expert IDs outside[0, num_experts)except for masked-1entries.The
handlefrom one dispatch call is reused with an unrelated combine call.zero_copy=Trueis used without writing expert outputs intoget_next_combine_buffer(handle).
Fixes:
Size the buffer for peak traffic:
num_ep_buffer_bytes = Buffer.get_ep_buffer_size_hint( max_tokens_per_rank, hidden, world_size, num_experts )
Keep dispatch and combine paired by using the matching
handle.Call
event.current_stream_wait()or the returnedhook()before consuming operation outputs.
P2P send/recv errors#
Mooncake PG’s PyTorch P2P path currently supports single-tensor send/recv
through the backend shim. If batch_isend_irecv() fails:
verify each
P2POpcontains one tensor;ensure peer ranks are active;
check that all ranks issue matching send/recv operations;
test CPU and accelerator backends separately to isolate device-specific transport issues.
Useful smoke commands#
# PG CPU sanity
python -m unittest discover -s mooncake-pg/tests -k CPU -v
# PG CUDA sanity
python -m unittest discover -s mooncake-pg/tests -k CUDA -v
# PG all-reduce benchmark smoke
PYTHONPATH=mooncake-pg \
python mooncake-pg/benchmark/pgbench.py \
--collective all_reduce --backend mooncake --device cuda -g 2 -b 8 -e 1M -f 2
# EP grid test
python mooncake-ep/tests/test_ep_grid.py
Adapt process counts, device filters, and launchers to the target cluster.
What to include in bug reports#
When reporting PG/EP issues, include:
Mooncake commit and installation method;
PyTorch version and CUDA/accelerator runtime version;
exact backend name (
mooncakeormooncake-cpu);world size, max world size, rank IDs, and subgroup layout;
active-rank tensor dtype/device/values;
HCA/device filters;
minimal command or script reproducing the issue;
logs from all ranks around initialization, failure detection, and recovery.