Understanding and Predicting Network Performance: It's Hard

Kevin Brown, Argonne National Laboratory
CS Seminar

"The network is the computer" is a phrase that was coined nearly four decades ago and remains as relevant as ever in this era of exascale computing. At the heart of a high-performance computer (HPC) system, or supercomputer, is advanced networking technologies that allow efficient coordination of distributed processing at extreme scales. However, due to the complexities of the network infrastructure and the scientific workloads that use them, ensuring good network performance is hard. Our work aims to provide a better understanding of network behaviors and bottlenecks to improve the performance of current supercomputers as well as the designs of future systems. The two core areas of this effort are (i) network performance measurement and analysis, and (ii) network modeling and simulations. This talk will provide an overview of our efforts in these areas, highlighting opportunities for collaboration on topics such as automating application performance analysis, coupling discrete event simulations with machine learning models, and analyzing large system performance datasets.

Kevin A. Brown

Kevin evaluates networking configurations for next-generation supercomputers as well as improves network simulators used for evaluating supercomputer designs. In the area of evaluation, he works on creating new performance analysis tools and techniques to measure and analyze the performance of AI and climate prediction applications on large supercomputers, focusing mainly on network and I/O traffic. In the area of network simulation, he models potential designs for future supercomputers and creates new simulation techniques that use ML-based surrogate models to accelerate parallel discrete event simulations (PDES).

Prior his fellowship, Kevin received his Ph.D. from the Tokyo Institute of Technology and then served as a postdoctoral appointee in the Argonne Leadership Computing Facility (ALCF).