Dynamic Load Balancer in Intel Xeon Scalable Processor: Performance Analyses, Enhancements, and Guidelines

Published in ISCA, 2025

paper

The rapid increase in inter-host networking speed has challenged host processing capabilities, as bursty traffic and uneven load distribution among host CPU cores give rise to excessive queuing delays and service latency variances. To cost-efficiently tackle this challenge, the latest Intel Xeon Scalable Processor has integrated an on-chip accelerator, named Dynamic Load Balancer (DLB). It consists of hardware queues, arbiters, and priority-based Quality of Service (QoS) features to maximize load-balancing performance while minimizing host CPU cycle consumption. In this work, we first compare the performance of DLB against popular softwarebased load balancers with a microbenchmark suite, demonstrating that DLB significantly outperforms them. Yet, DLB still consumes a significant number of host CPU cycles to prepare and enqueue work descriptors for received packets. Second, to eliminate the consumption of host CPU cycles in load balancing, we propose a system architecture/software co-design solution, AccDirect. Specifically, AccDirect leverages the Peer-to-Peer (P2P) communication capability of PCIe devices to enable a direct communication path between DLB and NIC, through which NIC directly enqueues the work descriptors. Our evaluation shows that AccDirect-based DLB offers practically the same performance as conventional DLB while reducing the system-wide power consumption of host CPU by 10%. Compared to the throughput of a commodity hardware-based load balancer for an end-to-end application, that of AccDirect-based DLB is 14–50% higher with comparable p99 latency. Lastly, we provide guidelines to make the best use of DLB, which is riddled with a vast configuration space and advanced features, after conducting a comprehensive evaluation.