PD Disaggregation Performance

PD Disaggregation Performance#

We evaluated the current implementation on two A10 servers. By comparing the performance of a 1P1D configuration with that of two regular (non-disaggregated) instances, we observed that P/D disaggregation achieves approximately 30% lower ITL while maintaining comparable total throughput. This aligns with findings from the Mooncake paper, which highlighted that P/D disaggregation is effective in reducing TBT/ITL under similar throughput conditions—or conversely, in enabling higher throughput under stricter ITL/TBT SLOs. Moreover, we anticipate even greater benefits in larger-scale clusters where both the number of prefill and decode nodes (x and y in xPyD configurations) increase, offering enhanced scheduling flexibility and resource efficiency.

Traffic Request Rate: 1.0#

model: Qwen2.5-7B-Instruct-GPTQ-Int4
TP: 4
random_input_len=8192, random_output_len=512
num prompt=50

Configuration	Output Token Throughput (tok/s)	Mean E2E Latency (ms)	Total Token Throughput (tok/s)	Mean TTFT (ms)	P99 TTFT (ms)	Mean ITL (ms)	P99 ITL (ms)
1P1D	407.59	3413.86	7084.46	732.54	2952.57	7.23	10.76
2 Regular	427.65	4586.54	7433.27	767.18	1264.88	10.30	12.73

Traffic Request Rate: 4.0#

model: Qwen2.5-7B-Instruct-GPTQ-Int4
TP: 2
random_input_len=2048, random_output_len=512
num prompt=200

Configuration	Output Token Throughput (tok/s)	Mean E2E Latency (ms)	Total Token Throughput (tok/s)	Mean TTFT (ms)	P99 TTFT (ms)	Mean ITL (ms)	P99 ITL (ms)
1P1D	1215.17	11519.24	6161.43	1111.94	2725.89	17.06	19.72
2 Regular	1223.03	11683.15	6201.29	310.01	720.91	25.74	294.89

PD Disaggregation Performance

Contents

PD Disaggregation Performance#

Traffic Request Rate: 1.0#

Traffic Request Rate: 4.0#