Heterogeneous Ascend Transport

Heterogeneous Ascend Transport#

Overview#

Heterogeneous Ascend Transport is a high-performance data transmission library designed for heterogeneous inference scenarios, specifically optimized to leverage the advantages of heterogeneous computing architectures. Key features and use cases include:

910B Device: Executes PREFILL operations for efficient inference computation
H20 Device: Handles DECODE operations for data decoding
Heterogeneous Ascend Transport: Manages cross-device KVCACHE transmission, enabling efficient data exchange between 910B NPU memory and H20 GPU memory

Current version only supports WRITE semantics. READ semantics will be implemented in future releases.

Enhanced HBM-to-DRAM Data Transfer Optimization#

The copy bandwidth from HBM to DRAM is constrained by the size of data blocks. Small data blocks smaller than 2MB result in underutilized bandwidth. We have implemented an optimization using “data aggregation + pipeline parallelism”: first, small data blocks are aggregated into 8MB blocks within HBM before being transferred to DRAM, while data copying and RDMA transmission are executed in parallel. This solution effectively hides the HBM-DRAM copy latency and significantly reduces the overall transmission time.

Build Instructions#

The USE_ASCEND_HETEROGENEOUS compilation option has been added to mooncake-common/common.cmake to control this feature:

PREFILL Side (910B):
- Enable USE_ASCEND_HETEROGENEOUS option
- Recompile Mooncake project
DECODE Side (H20):
- Use Mooncake RDMA Transport directly
- Enable GPU Direct functionality (supports direct GPU memory access via RDMA)

Parameter Configuration#

The following key parameters are required:

Parameter	Description
`source`	Source address (910B NPU memory address)
`target_offset`	Target address offset (H20 GPU memory address)
`opcode`	Operation code (currently only `WRITE` supported)

Testing Guide#

Test Cases#

Available test programs:

Initiator (910B): transfer_engine_heterogeneous_ascend_perf_initiator.cpp
Target (H20): rdma_transport_test.cpp (reusing existing test)

After compilation, test programs are located at:

build/mooncake-transfer-engine/example/
build/mooncake-transfer-engine/test/

Test programs can be configured via DEFINE_string parameters. Refer to the parameter list at the beginning of each test file.

Test Procedure#

When metadata_server=P2PHANDSHAKE is configured, the system automatically selects available ports to avoid conflicts. Testing steps:

Start Target Node
- Check target node logs (in transfer_engine.cpp)
- Record listening information in this format:
```
Transfer Engine RPC using <protocol> listening on <IP>:<port>
```
Configure Initiator Node
- Set --segment_id parameter to the actual listening address recorded in step 1 (format: <IP>:<port>)
Start Initiator Node to complete connection test

Example Commands#

Target (H20):

./rdma_transport_test --mode=target --local_server_name=10.10.10.10 --metadata_server=P2PHANDSHAKE --operation=write --protocol=rdma --device_name=mlx5_1 --use_vram=true --gpu_id=0

Initiator (910B):

./transfer_engine_heterogeneous_ascend_perf_initiator --mode=initiator --local_server_name=10.10.10.10 --metadata_server=P2PHANDSHAKE --operation=write --npu_id=1 --segment_id=10.10.10.10:12345 --device_name=mlx5_1 --block_size=65536 --batch_size=128