Kunpeng UB Transport for Mooncake#
This document describes how to build and use Mooncake with Kunpeng UB (Unified Bus) transport support using URMA (Unified Remote Memory Access).
Overview#
UB (Unified Bus) is a transport protocol at the same abstraction layer as RDMA, CXL, NVLink, and TCP, providing a flexible transport solution that can be selected at the application layer. Currently, UB protocol has two open-source implementations:
URMA (Unified Remote Memory Access): Provides a unified programming abstraction and core semantic layer for upper-layer applications. It offers unified APIs and semantic interfaces for remote shared memory access and operations, leveraging the low-latency, high-bandwidth characteristics of the UB protocol.
URMA open-source repository: https://atomgit.com/openeuler/umdk
OBMM (Ownership Based Memory Management): A kernel memory management system for supernode environments, supporting cross-node physical memory sharing. It provides efficient remote memory access capabilities through a kernel module (obmm.ko) and a user-space library (libobmm.so).
OBMM open-source repository: https://atomgit.com/openeuler/obmm
Prerequisites#
1. Hardware and Operating System#
Hardware Platform: Kunpeng 950 CPU with native UB interconnect architecture
OS Version: openEuler 24.03 (LTS-SP3) Download link
2. URMA Dependencies#
Install UMDK (URMA development package):
# Install via yum
yum install umdk-urma-devel
# Or build from source
git clone https://atomgit.com/openeuler/umdk.git
cd umdk
mkdir build && cd build
cmake ..
make -j$(nproc)
sudo make install
3. Build Dependencies#
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y \
build-essential \
cmake \
git \
libgflags-dev \
libgoogle-glog-dev \
libjsoncpp-dev \
libnuma-dev \
libibverbs-dev \
libboost-dev \
libcurl4-openssl-dev \
libgtest-dev \
libmsgpack-dev \
libxxhash-dev \
libyaml-cpp-dev \
pybind11-dev \
python3-dev
# Install yalantinglibs (required)
cd /tmp
git clone https://github.com/alibaba/yalantinglibs.git
cd yalantinglibs
mkdir build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=/usr/local
make -j$(nproc)
sudo make install
Building Mooncake with UB Support#
1. Clone the Repository#
git clone https://github.com/kvcache-ai/Mooncake.git
cd Mooncake
git submodule update --init --recursive
2. Build with UB Enabled#
mkdir build && cd build
cmake .. \
-DUSE_UB=ON \
-DURMA_INCLUDE_DIR=/usr/include \
-DURMA_LIBRARY=/usr/lib64/liburma.so \
-DCMAKE_BUILD_TYPE=RelWithDebInfo
make -j$(nproc)
3. Install Python Package#
# Copy built modules to wheel directory
cp mooncake-integration/engine.cpython-*.so ../mooncake-wheel/mooncake/
cp mooncake-integration/store.cpython-*.so ../mooncake-wheel/mooncake/
cp mooncake-common/libasio.so ../mooncake-wheel/mooncake/
# Install with pip
pip install -e ../mooncake-wheel --no-build-isolation
Verification#
Check UB Transport Registration#
# Check if UB transport is registered
./mooncake_server --list-transports
# Expected output: rdma, tcp, nvlink, ub
Test UB Transport Initialization#
from mooncake.engine import TransferEngine
te = TransferEngine()
result = te.initialize('127.0.0.1', 'P2PHANDSHAKE', 'ub', '')
print(f'Initialize result: {result}') # Should be 0
# You should see logs like:
# URMA module init success
# found 1 devices.
# device_name : urma0 EID : 01:02:03:04:05:06:07:08:09:0a:0b:0c:0d:0e:0f:10
Usage#
Single Node Benchmark Test#
# Terminal 1: Target (receiver)
./transfer_engine_bench \
--mode=target \
--protocol=ub \
--device_name=urma0 \
--local_server_name=127.0.0.1 \
--metadata_server=P2PHANDSHAKE
# Terminal 2: Initiator (sender)
./transfer_engine_bench \
--mode=initiator \
--protocol=ub \
--device_name=urma0 \
--metadata_server=P2PHANDSHAKE \
--segment_size=8388608 \
--batch_size=1 \
--segment_id=127.0.0.1:$PORT
Multi-device Benchmark Test#
# Auto-discovery of multiple URMA devices
./transfer_engine_bench \
--protocol=ub \
--device_name=urma0,urma1,urma2,urma3
Unit Tests#
Run the UB transport unit tests:
./build/mooncake-transfer-engine/tests/ub_transport_test
The test suite includes:
Test |
Description |
|---|---|
|
Multiple write operations |
|
Multiple read operations with data integrity check |
You can also run all unit tests via CTest:
cd build && ctest --output-on-failure
Environment variables for test configuration:
export MC_METADATA_SERVER=P2PHANDSHAKE # default
export MC_LOCAL_SERVER_NAME=127.0.0.1:12345 # default
Technical Details#
UB Transport Architecture#
┌─────────────────────────────────────────────────────┐
│ UbTransport │
├─────────────────────────────────────────────────────┤
│ UrmaContext (per device) │
│ ├── urma_device (URMA device handle) │
│ ├── urma_context (URMA context) │
│ ├── urma_jfce (URMA jetty factory create) │
│ ├── urma_jfc (URMA jetty factory send) │
│ └── urma_jfr (URMA jetty factory receive) │
├─────────────────────────────────────────────────────┤
│ UrmaEndpoint (per connection) │
│ ├── urma_jetty (URMA jetty for communication) │
│ ├── local_jetty (local jetty ID) │
│ └── remote_jetty (remote jetty ID) │
└─────────────────────────────────────────────────────┘
Key Components#
UbTransport: The main transport class that manages URMA resources and endpoints
UrmaContext: Represents a URMA device context, handling device initialization and resource management
UrmaEndpoint: Represents a connection to a remote peer, handling data transfer operations
mock_urma_api.cpp: Mock implementation of URMA API for testing without real URMA hardware
Protocol Advantages#
Optimized for Kunpeng: URMA is specifically optimized for Kunpeng chip on-chip interconnect
RDMA-like Semantics: Provides similar memory semantics to RDMA
High Performance: Leverages UB’s low-latency, high-bandwidth characteristics
Unified Abstraction: Offers a unified programming model for remote memory access
Troubleshooting#
No URMA devices found#
UbTransport: No URMA devices found
Solution: Verify URMA is properly installed and devices are available:
# Check URMA installation
ls /usr/lib64/liburma.so
ls /usr/include/ub/umdk/urma/urma_api.h
# Check for URMA devices
urma_admin -l
URMA initialization failed#
URMA module init failed
Solution: Ensure the URMA kernel module is loaded and the device is properly configured:
# Load URMA module
sudo modprobe urma
# Check module status
sudo lsmod | grep urma
# Check device status
urma_admin -l
Device port inactive#
Device urma0 port not active
Solution: Ensure the UB port is properly configured and active:
# Check port status
urma_admin -p urma0
Missing liburma.so#
cannot find -lurma
Solution: Verify URMA library is installed and in the library path:
export LD_LIBRARY_PATH=/usr/lib64:$LD_LIBRARY_PATH
Conclusion#
Kunpeng UB Transport provides a high-performance, optimized transport solution for Mooncake on Kunpeng 950 CPU platforms. By leveraging the UB protocol’s low-latency and high-bandwidth characteristics, it offers comparable performance to RDMA while being specifically tailored for Kunpeng chip architectures.
With proper configuration and tuning, UB Transport can significantly improve the performance of distributed AI workloads, particularly for scenarios involving large-scale parameter transfers and distributed training.