PD Disaggregation Performance#

We evaluated the current implementation on two A10 servers. By comparing the performance of a 1P1D configuration with that of two regular (non-disaggregated) instances, we observed that P/D disaggregation achieves approximately 30% lower ITL while maintaining comparable total throughput. This aligns with findings from the Mooncake paper, which highlighted that P/D disaggregation is effective in reducing TBT/ITL under similar throughput conditions—or conversely, in enabling higher throughput under stricter ITL/TBT SLOs. Moreover, we anticipate even greater benefits in larger-scale clusters where both the number of prefill and decode nodes (x and y in xPyD configurations) increase, offering enhanced scheduling flexibility and resource efficiency.

Traffic Request Rate: 1.0#

  • model: Qwen2.5-7B-Instruct-GPTQ-Int4

  • TP: 4

  • random_input_len=8192, random_output_len=512

  • num prompt=50

Configuration

Output Token Throughput (tok/s)

Mean E2E Latency (ms)

Total Token Throughput (tok/s)

Mean TTFT (ms)

P99 TTFT (ms)

Mean ITL (ms)

P99 ITL (ms)

1P1D

407.59

3413.86

7084.46

732.54

2952.57

7.23

10.76

2 Regular

427.65

4586.54

7433.27

767.18

1264.88

10.30

12.73

Traffic Request Rate: 4.0#

  • model: Qwen2.5-7B-Instruct-GPTQ-Int4

  • TP: 2

  • random_input_len=2048, random_output_len=512

  • num prompt=200

Configuration

Output Token Throughput (tok/s)

Mean E2E Latency (ms)

Total Token Throughput (tok/s)

Mean TTFT (ms)

P99 TTFT (ms)

Mean ITL (ms)

P99 ITL (ms)

1P1D

1215.17

11519.24

6161.43

1111.94

2725.89

17.06

19.72

2 Regular

1223.03

11683.15

6201.29

310.01

720.91

25.74

294.89