Researchers are set to use the ALCF's Aurora supercomputer to pursue scientific breakthroughs, thanks to a multiyear, collaborative effort to port and optimize applications for the system.
Aurora, the exascale system at the U.S. Department of Energy’s (DOE’s) Argonne National Laboratory, promises to enable breakthroughs across scientific disciplines.
To support such research diversity from day one, a broad range of applications has been developed for Aurora in parallel with the system’s own development. The insights and lessons learned provided by these code development efforts—supported by the Exascale Computing Project (ECP) and Aurora Early Science Program (ESP)—are expected to enable an even greater array of applications to run on the Intel-Hewlett Packard Enterprise (HPE) supercomputer.
“Ideally, users of, say, our Polaris machine would be able to take their codes, deploy them on Aurora without too much additional effort, and expect reasonable performance,” said Colleen Bertoni, an ALCF computational scientist. “In reality, however, there is a multiplicity of vendors, hardware, and software stacks that significantly complicates inter-system portability.”
The Argonne Leadership Computing Facility (ALCF), the DOE Office of Science user facility that houses Aurora, closely tracked the progress of a set of these first applications throughout their exascale development.
The product of large-scale planning and co-design from the national laboratory system, the exascale systems being deployed—ultrapowerful supercomputers capable of performing a billion billion calculations per second—including Aurora, are revolutionizing the country's scientific computing, with new GPU-based architecture and hardware that require codes, frameworks, and libraries built from the ground up.
As expected for a leadership-class supercomputer, initially porting applications to Aurora was challenging. Intel Data Center Max Series GPUs—the graphics processing units (GPUs) that provide the vast majority of the system’s power—are the first of their kind from Intel. And the compiler and library frameworks, being new and under development, had not been extensively tested beforehand. In addition to new hardware and software, because Aurora’s computational power is an order of magnitude greater than previous petascale systems like Polaris, many application developers were formulating new algorithms and physics to take advantage of the additional computational power.
This resulted in a multiyear development effort, involving many steps, for application developers to incorporate new exascale-ready physics into their codes, to choose and port their code to a programming model which runs on Intel GPUs, and to ensure that the codes run efficiently on those GPUs.
“Given the challenges we faced, application developers were often in a situation where even if we find one workaround for a compiler bug, another bug pops up,” Bertoni said. “Collaborations, hackathons, and workshops enabled teams to work together to examine lines of code—and even assembly—to tune applications and ultimately get the codes working well.”
ECP and ESP application developers collaborated with ALCF and Intel staff via the Aurora Applications Working Group, where code and performance updates were communicated on a quarterly basis. Bugs and other issues were tracked and reported in a common repository.
Tracking progress was vital and visual. “People use these machines to perform complex computations very fast, so it’s not sufficient for applications to merely run—they must run well. To be able to determine if we were on track, progress was literally recorded on a chart in red and different shades of green to represent the codes that were not working and those that were,” said Scott Parker, performance engineering team lead at the ALCF.
“We’re very happy with the number of applications now running well on Aurora,” he continued. “As well as covering a large number of scientific domains—astrophysics, fluid dynamics, materials science—they provide a powerful platform for harnessing Aurora’s AI capabilities.”
To execute code on GPUs, specific programming models are needed to launch the code onto the GPUs. Thus, application developers need to choose a programming model to run their code on the GPUs on Aurora.
Some application codes were originally written for central processing unit- (CPU-) only systems, while others had been created with a different GPU in mind.
“Some of the codes we brought to Aurora were 20 or 30 years old. But in order to enable breakthrough science, codes often have to evolve to be able to do new things,” Bertoni said, “which means at least portions of the codes need to be rewritten.”
While some decisions on which programming model to use were relatively straightforward, others involved more evaluation and trial and error.
“Codes that had been running on CPU-only systems typically were brought to Aurora with MPI and OpenMP programming models. OpenMP is also generally what’s used for Fortran codes, unless they’re rewritten using C++,” Parker explained. “Codes written in C++ had a lot of options, and there are many models to choose from for portability: OpenMP, SYCL, oneAPI. There are several DOE-based models, including Kokkos and RAJA.”
“People have taken many different approaches over the last few years. Some codes were already using Kokkos, so the development teams stuck with it for Aurora,” Bertoni said. “Codes that had been designed for NVIDIA processors usually had CUDA as a model, which bears a lot of similarities to SYCL. In migrating from CUDA to SYCL, there are tools, such as SYCLomatic from Intel, that automate part of the process.”
Scaling on a system as large as Aurora is a persistent challenge.
“What makes a supercomputer ‘super’ is all the nodes—of which Aurora has over 10,000—working together, which yields performance distinct from that of a single rack or server. For this reason, we strongly encourage teams to use at least 20 percent of the system for optimal performance,” Parker said.
This makes scaling a crucial aspect of the development process.
“We have to deal with scaling on every machine, but every machine is different and comes with unique challenges,” said Parker. “We work to scale incrementally, beginning with one node and then, say, 64 nodes and then thousands, as more become available to us.”
Some of the early application codes are already running on Aurora on up to 8,192 nodes.
“This is close to where we want codes to be, so the codes operating at this level can provide something of a path for other developers to follow,” said Thomas Applencourt, an ALCF computational scientist.
Load balance is one of the biggest problems to solve when trying to achieve scaling.
“Let’s say you’re running a fluid dynamics solver,” Applencourt said. “You have a wall as your domain, and part of the field you’re simulating intersects that wall. You have to treat the wall portion of the field differently than you do the rest of the field, which can result in either more or less computational work, so the node that’s responsible for the wall portion of the field may run faster or slower than the other nodes. If one node has to do more work than the others, the other nodes have to wait for the other to catch up, which eats away at efficiency.”
As a result, ALCF staff and developers have been working hard to achieve even distribution of work across the Aurora system.
“Strong scaling versus weak scaling is another big challenge. If you want to run bigger and bigger problems, that’s relatively easy in that you have the machine do more work as you get more nodes,” ALCF computational scientist JaeHyuk Kwack said. “But if you want to run a fixed size problem—say, simulating the weather on the earth—that can be trickier. The earth doesn’t get bigger, it’s a fixed size, so you want to perform the simulation faster and faster, so you bring in more nodes. But the amount of work per node decreases, and these inefficiencies creep in—GPU offload has some overhead, the communication has some overhead. Those inefficiencies start to eat away at overall efficiency, so that’s one of the areas in which performance optimization can require a lot of effort.”
The ALCF estimates that as many as 200 different applications will run on Aurora throughout its lifespan; but thanks to the efforts of the early development teams, the process of migrating the successive codes should be comparatively easy.
“We better understand the tradeoffs involved with different models and approaches, which makes it easier for users to choose a path, and our toolchain has improved dramatically,” Bertoni said. “New users bringing over new applications aren’t expected to face as many challenges.”
Applications now running on Aurora include AMR-Wind, Flash-X, GAMESS, CRK-HACC, LAMMPS, NWChemEx, OpenMC, SW4, and XGC.
“More than anything, we’re excited for this system to start enabling scientific breakthroughs and accelerating discovery,” Parker said.