PD Disaggregation Performance#
We evaluated the current implementation on two A10 servers. By comparing the performance of a 1P1D configuration with that of two regular (non-disaggregated) instances, we observed that P/D disaggregation achieves approximately 30% lower ITL while maintaining comparable total throughput. This aligns with findings from the Mooncake paper, which highlighted that P/D disaggregation is effective in reducing TBT/ITL under similar throughput conditions—or conversely, in enabling higher throughput under stricter ITL/TBT SLOs. Moreover, we anticipate even greater benefits in larger-scale clusters where both the number of prefill and decode nodes (x and y in xPyD configurations) increase, offering enhanced scheduling flexibility and resource efficiency.
Traffic Request Rate: 1.0#
model: Qwen2.5-7B-Instruct-GPTQ-Int4
TP: 4
random_input_len=8192, random_output_len=512
num prompt=50
Configuration |
Output Token Throughput (tok/s) |
Mean E2E Latency (ms) |
Total Token Throughput (tok/s) |
Mean TTFT (ms) |
P99 TTFT (ms) |
Mean ITL (ms) |
P99 ITL (ms) |
---|---|---|---|---|---|---|---|
1P1D |
407.59 |
3413.86 |
7084.46 |
732.54 |
2952.57 |
7.23 |
10.76 |
2 Regular |
427.65 |
4586.54 |
7433.27 |
767.18 |
1264.88 |
10.30 |
12.73 |
Traffic Request Rate: 4.0#
model: Qwen2.5-7B-Instruct-GPTQ-Int4
TP: 2
random_input_len=2048, random_output_len=512
num prompt=200
Configuration |
Output Token Throughput (tok/s) |
Mean E2E Latency (ms) |
Total Token Throughput (tok/s) |
Mean TTFT (ms) |
P99 TTFT (ms) |
Mean ITL (ms) |
P99 ITL (ms) |
---|---|---|---|---|---|---|---|
1P1D |
1215.17 |
11519.24 |
6161.43 |
1111.94 |
2725.89 |
17.06 |
19.72 |
2 Regular |
1223.03 |
11683.15 |
6201.29 |
310.01 |
720.91 |
25.74 |
294.89 |