Troubleshooting#
This document lists common errors that may occur when using Mooncake Store and provides troubleshooting and resolution measures.
CheckList
[ ]
connectable_nameis not the local machine’s LAN/WAN address, such as the loopback address (127.0.0.1/localhost) and address of other machines.[ ] Incorrect MTU and GID configurations. Use the environment variables MC_MTU and MC_GID_INDEX.
[ ] Incorrect RDMA device name and connection status is not active.
[ ] etcd is not started normally and it is not bind with
0.0.0.0.
Metadata and Out-of-Band Communication#
At startup, a
TransferMetadataobject is constructed according to the incomingmetadata_serverparameter. During program execution, this object is used to communicate with the etcd server to maintain internal data required for connection.At startup, the current node is registered with the cluster according to the incoming
connectable_nameparameter andrpc_portparameter, and the TCP port specified by therpc_portparameter is listened. Before other nodes send the first read/write request to the current node, they will use the above information, resolve DNS, and initiate a connection through socket’sconnect()method.
Errors in this part usually indicate that the error occurred within the mooncake-transfer-engine/src/transfer_metadata.cpp file.
Recommended Troubleshooting Directions#
The incoming
metadata_serverparameter is not a valid and reachable etcd server address (or a group of addresses). In this case, anError from etcd clienterror will be displayed. It is recommended to investigate from the following aspects:After installing the etcd service, the default listening IP is 127.0.0.1, which other nodes cannot use. Therefore, the actual listening IP should be determined in conjunction with the network environment. In the experimental environment, 0.0.0.0 can be used. For example, the following command line can be used to start the required service:
# This is 10.0.0.1 etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://10.0.0.1:2379
You can verify this on other nodes using
curl <metadata_server>.HTTP proxies need to be disabled before starting the program.
unset http_proxy unset https_proxy
Other nodes cannot establish a socket out-of-band communication with the current node through the incoming
connectable_nameparameter andrpc_portparameter to implement connection establishment operations. The types of errors in this case may include:A
bind address already in useerror is displayed when starting the process, usually because the port number corresponding to therpc_portparameter is occupied. Try using another port number.Another node displays a
connection refusedtype of error when initiating a Batch Transfer to the current node, which requires a focus on checking the correctness of these two parameters when the current node is created.connectable_namemust be a non-Loopback IP address of the current node or a valid Hostname (with records in DNS or/etc/hosts), and other nodes in the cluster can useconnectable_nameandrpc_portparameters to connect to the current node.There may be firewall mechanisms in some networks that need to be added to the whitelist in advance.
If the correspondence between
local_server_nameandconnectable_name/rpc_portparameters changes and various errors occur, you can try clearing the etcd database and restarting the cluster.
RDMA Resource Initialization#
At startup, all RDMA network cards corresponding to the incoming
nic_priority_matrixparameter are initialized, including device contexts and other internal objects.Before other nodes send the first read/write request to the current node, they will exchange GID, LID, QP Num, etc., through out-of-band communication mechanisms and complete the establishment of RDMA reliable connection paths.
Errors in this part usually indicate that the error occurred within the mooncake-transfer-engine/src/transport/rdma_transport/rdma_*.cpp files.
Recommended Troubleshooting Directions#
If the error
No matched device foundis displayed, check if there are any network card names in thenic_priority_matrixparameter that do not exist on the machine. You can use theibv_devinfocommand to view the list of installed network cards on the machine.If the error
Device XXX port not activeis displayed, it indicates that the default RDMA Port of the corresponding device (RDMA device port, to be distinguished from therpc_portTCP port) is not in the ACTIVE state. This is usually due to RDMA cables not being installed properly or the driver not being configured correctly. You can use theMC_IB_PORTenvironment variable to change the default RDMA Port used.If both the error
Worker: Cannot make connection for endpointandFailed to exchange handshake descriptionare displayed, it indicates that the two parties cannot establish a reliable RDMA connection path. In most cases, it is usually due to incorrect configuration on one side or the inability of both parties to reach each other. First, use tools likeib_send_bwto confirm the reachability of the two nodes and pay attention to the output of GID, LID, MTU, and other parameter information. Then, analyze the possible error points based on the error message:After starting, the log output usually includes several lines of log information like
RDMA device: XXX, LID: XXX, GID: (X) XX:XX:XX:.... If the displayed GID address is all 0 (the bracket indicates GID Index), you need to choose the correct GID Index according to the network environment and specify it at startup using theMC_GID_INDEXenvironment variable.If the error
Failed to modify QP to RTR, check mtu, gid, peer lid, peer qp numis displayed, first determine which party the error occurred on. If there is no prefixHandshake request rejected by peer endpoint:, it indicates that the problem comes from the party displaying the error. According to the error message, you need to check the MTU length configuration (adjust using theMC_MTUenvironment variable), whether your own and the other party’s GID addresses are valid, etc. At the same time, if the two nodes cannot achieve physical connection, it may also be hang/interrupted at this step, please pay attention.
If you encounter an error like
Failed to register memory 0x7efb94000000: Input/output error [5], this may indicate that RDMA memory registration has failed. This is usually caused by device limitations where some machines can only register a maximum of 64GB of memory. When the registration exceeds this limit, it will fail.Diagnostic Commands:
Use
ulimit -ato check system limits, particularly themax locked memoryvalueUse
ibv_devinfo -vto check RDMA device capabilities, especially themax_mr_sizefieldUse
dmesg -Tto view detailed failure logs, which may show messages like:[Wed Jul 30 11:49:18 2025] infiniband erdma_0: ERROR: Out of mr size: 0, max 68719476736
Solution: Ensure that the total memory registration does not exceed the device’s upper limit. You may need to reduce the amount of memory being registered or split large memory regions into smaller chunks that fit within the device’s
max_mr_sizelimit.
RDMA Transfer Period#
Recommended Troubleshooting Directions#
If the network state is unstable, some requests may not be delivered, displaying errors like Worker: Process failed for slice. Transfer Engine can avoid problems by reselecting paths, etc. In some complex cases, if a large number of such errors are output continuously, it is recommended to search for the cause of the problem according to the string prompt of the last field.
Note: In most cases, the errors output, except for the first occurrence, are work request flushed error. This is because when the first error occurs, the RDMA driver sets the connection to an unavailable state, so tasks in the submission queue are blocked from execution and subsequent errors are reported. Therefore, it is recommended to locate the first occurrence of the error and check it.
In addition, if the error Failed to get description of XXX is displayed, it indicates that the Segment name input by the user when calling the openSegment interface cannot be found in the etcd database. For memory read/write scenarios, the Segment name needs to strictly match the local_hostname field filled in by the other node during initialization.
SGLang Common Questions#
Do I need RDMA to run SGLang and Mooncake?#
When using Mooncake for KV cache transfer in SGLang PD disaggregation deployments, GPUDirect RDMA (GDR) is required.
When using Mooncake as a KV cache storage backend in SGLang HiCache, RDMA is recommended for better performance. However, if RDMA NICs are not available, the TCP protocol is also supported.
How to make sure GPUDirect RDMA (GDR) is supported#
Verify the presence of an RDMA-capable NIC (e.g., Mellanox, ERDMA) and drivers.
ibv_devices
lspci | grep rdma
lsmod | grep -E 'ib_core|mlx4_core|mlx5_core|nvidia_peer_mem'
If no RDMA devices appear: (1) Confirm physical NIC presence via lspci (2) Install vendor-specific drivers (e.g., Mellanox MLNX_OFED)
check GDR driver is ready, and peer_memory module (part of MLNX_OFED) should be installed
# Check peer_memory module (from MLNX_OFED)
lsmod | grep peer_mem
# Verify NVIDIA peer memory module
lsmod | grep nvidia_peer_mem
If you use container to run SGLang, please make sure RDMA and GDR driver are installed in the container and run container in privileged mode. Requirements: (1) privileged mode must be enabled. (2) RDMA devices/NVIDIA devices mounted into container
Check the connectivity Benchmark end-to-end performance using ib_write_bw.
apt install perftest
# server side
ib_write_bw -d [rdma_device] -R -x gdr
# client side
ib_write_bw -d [rdma_device] -R -x gdr [server_ip]
Expected Output: Successful bidirectional transfer with “BW peak” reported Errors with -x gdr indicate GDR setup failures