As part of a new series aimed at sharing best practices in preparing applications for Aurora, we highlight researchers' efforts to optimize codes to run efficiently on graphics processing units.
The impending arrival of graphics processing unit (GPU)-based systems at the Argonne Leadership Computing Facility (ALCF) requires teams of researchers working to port their codes, many of which were written exclusively for central processing unit (CPU)-based machines. The ALCF is a U.S. Department of Energy Office of Science User Facility located at Argonne National Laboratory.
Argonne computational scientist Ye Luo and ALCF postdoctoral appointee Pankaj Rajak must prepare QXMD, a Fortran-based scalable quantum molecular dynamics code, in this manner. QXMD emerged from an Aurora Early Science Program project, “Metascalable Layered Materials Genome,” led by Aiichiro Nakano of the University of Southern California and aimed at readying materials science for the delivery of the ALCF’s exascale machine, Aurora. The simulations produced by the code explore nonadiabatic quantum molecular dynamics, which sits at the nexus of physics, chemistry, and materials science; these types of quantum simulations are extremely computationally expensive.
Because no prior GPU version of the code existed, the team had to determine a path to enabling the code with performance portable GPU acceleration without interrupting scientific production.
After assessing the computational profile of the application, the developer team decided to improve the abstraction of the code by reorganizing the most computationally intensive parts into internal modules which could be separately developed and validated. In this vein, the team created a mini-application, Local Field Dynamics (LFD), which computes many-electron dynamics in the framework of real-time time-dependent density functional theory and represents one of the most computationally expensive QXMD kernels.
Written in C++ rather than the Fortran in which the greater QXMD code was developed so as to enable more efficient work, the mini-app implements synthetic test values to perform real computational routines. Exploiting the separability of the code, the mini-app was designed as a plugin, allowing it to be configured independently of the greater QXMD. Once the correct mathematical environment had been developed within the mini-app, the researchers were able to direct their focus to porting it to GPU.
As this effort began before the Intel compiler for GPU offload was available, the team took an indirect route to prepare for porting. To enable successful portability between different platforms, OpenMP offload capability for LFD was developed to using the IBM XL and LLVM Clang compilers on NVIDIA GPUs, the computational patterns of which were analyzed and optimized. With this setup, the team could use their code to configure Intel software in terms of both capability and performance on Intel-integrated and discrete GPUs.
Once Intel compilers became available, they began validating it with progressive complexity. The team keeps several versions of the code. Earlier ones are simpler, with fewer OpenMP offload features used, and demand less of the compiler, whereas later versions integrate more offload regions and advanced OpenMP features so as to make more challenging demands of the compiler. This allows them to work by solving smaller problems in a piecemeal fashion while also retaining the ability to stress-test the compiler. Meanwhile, the full QXMD program is used to validate the Intel Fortran compiler on CPU systems.
This Argonne-Intel co-design approach has led to the software stack’s quick maturation to production quality for execution on Aurora, and benefits many developer teams.