Ascend Direct Transport#
Ascend Direct Transport (referred to as Transport throughout this document) is located at Mooncake/mooncake-transfer-engine/src/transport/ascend_transport/ascend_direct_transport.
Overview#
Transport is a transmission adapter layer built on CANN’s ADXL capabilities, providing direct compatibility with the Mooncake Transfer Engine. With Transport, users can perform Host-to-Device, Device-to-Host, and Device-to-Device transfers in Ascend environments using various communication protocols such as HCCS and RDMA. Future enhancements will enable data transfer between heterogeneous Ascend and GPU environments.
To build and use the Transport library, please follow the process described in build.md and enable the USE_ASCEND_DIRECT option.
Additional Dependencies#
Ascend Compute Architecture for Neural Networks (CANN)
Please install the latest version of Ascend Compute Architecture for Neural Networks.
Test Cases#
Transport provides a performance test file at mooncake-transfer-engine/example/transfer_engine_ascend_direct_perf.cpp.
When the metadata_server is configured as P2PHANDSHAKE, Mooncake randomly selects listening ports from the new RPC port mapping to avoid port conflicts. Therefore, testing requires the following steps:
Start the target node first and observe the logs printed in
mooncake-transfer-engine/src/transfer_engine.cpp. Look for a statement in the following format:Transfer Engine RPC using <protocol> listening on <IP>:<actual_port>
Record the actual listening port number of the target node.
Modify the startup command for the initiator node: Change the
--segment_idparameter value to the target node’s IP + actual listening port number (format:<IP>:<port>).Start the initiator node to complete the connection test.
Complete command format is shown below:
Start target node:
./transfer_engine_ascend_direct_perf --metadata_server=P2PHANDSHAKE --local_server_name=127.0.0.1:12345 --operation=write --device_logicid=0 --mode=target --block_size=16384 --batch_size=32 --block_iteration=10
Start initiator node:
./transfer_engine_ascend_direct_perf --metadata_server=P2PHANDSHAKE --local_server_name=127.0.0.1:12346 --operation=write --device_logicid=1 --mode=initiator --block_size=16384 --batch_size=32 --block_iteration=10 --segment_id=127.0.0.1:real_port
Important Notes#
Device Setup Required: Before calling
TransferEngine.initialize(), you must set the device (e.g.,torch.npu.set_device(0)).TCP Connection: Transport internally establishes a host-side TCP connection and automatically searches for available ports. An error will be reported if no ports are available.
Memory Alignment for HCCS: When using the
HCCScommunication protocol, registered memory addresses must be aligned with page tables (device memory: 2MB alignment).Configuration File: Ensure
/etc/hccn.confexists, especially in containers. Mount/etc/hccn.confor copy the host file to the container’s/etcpath.Timeout Configuration:
Use the
ASCEND_CONNECT_TIMEOUTenvironment variable to control link establishment timeout (default: 3 seconds)Use the
ASCEND_TRANSFER_TIMEOUTenvironment variable to control data transfer timeout (default: 3 seconds)
RDMA Timeout and Retry Configuration:
Use
HCCL_RDMA_TIMEOUTto configure the RDMA NIC packet retransmission timeout coefficient. The actual packet retransmission timeout is4.096us * 2 ^ $HCCL_RDMA_TIMEOUTUse
HCCL_RDMA_RETRY_CNTto configure the RDMA NIC retransmission countIt is recommended to configure
ASCEND_TRANSFER_TIMEOUTto be slightly larger thanretransmission_timeout * HCCL_RDMA_RETRY_CNT
Communication Protocol Selection: Within A2 servers/A3 supernodes, the default communication protocol is HCCS. You can specify RDMA by setting
export HCCL_INTRA_ROCE_ENABLE=1. For KV Cache transfer scenarios, RDMA transfer is recommended to avoid conflicts with model collective communication traffic that could impact inference performance.RDMA Configuration: When using the
RDMAcommunication protocol, in scenarios where switch and NIC default configurations are inconsistent or traffic planning is required, you may need to modify the RDMA NIC’s Traffic Class and Service Level configuration:Set Traffic Class using the
ASCEND_RDMA_TCenvironment variableSet Service Level using the
ASCEND_RDMA_SLenvironment variable
Buffer Pool Configuration: Under the default 4KB page table configuration, the registrable host memory is approximately 20GB using RDMA. Additionally, HCCS does not currently support host memory transmission. Under these two constrained scenarios, transmission can be achieved by using intermediate buffers. The specific way to enable this is by configuring the ASCEND_BUFFER_POOL environment variable in the format BUFFER_NUM:BUFFER_SIZE (in MB). The recommended size is 4:8, which can be adjusted to the most suitable configuration based on actual scenarios.