Optimizing PETSc for exascale

science
Best Practices for GPU Code Development

An overview of the efforts to prepare the PETSc library for deployment on Aurora.

Junchao Zhang, a software engineer at the U.S. Department of Energy’s (DOE) Argonne National Laboratory, is leading a team of researchers working to prepare PETSc (Portable, Extensible Toolkit for Scientific Computation) for the nation’s exascale supercomputers—including Aurora, the exascale system set for deployment at the Argonne Leadership Computing Facility (ALCF), a DOE Office of Science user facility located at Argonne.

Best practices

 

  • -Use the Kokkos math library as a wrapper instead of writing multiple interfaces for different vendor libraries.

PDE library used in numerous fields

PETSc is a math library for the scalable solution of models generated with continuous partial differential equations (PDEs). PDEs, fundamental for describing the natural world, are ubiquitous in science and engineering. As such, PETSc is used across numerous disciplines and industry sectors, including aerodynamics, neuroscience, computational fluid dynamics, seismology, fusion, materials science, ocean dynamics, and the oil industry.

Lessons learned

 

  • -Avoid duplicating code.

 

  • -Isolate CPU-GPU data synchronizations, as data out of synchronization bugs are hard to notice and debug.

 

  • -Profile often and visualize GPU timeline to identify hidden and unexpected activities and get rid of them.

As researchers from both science and industry seek to generate increasingly high-fidelity simulations and apply them to increasingly large-scale problems, PETSc stands to directly benefit from the advances of exascale computing power. In addition, the technology developed for exascale can also be applied to less powerful computing systems and make applications of PETSc on such systems faster and cheaper, in turn resulting in broader adoption.

Furthermore, each of the exascale machines scheduled to come online at DOE facilities has adopted an accelerator-based architecture and derives the majority of its compute power from graphics processing units (GPUs). This has made porting PETSc for efficient use on GPUs an absolute necessity.

However, every vendor of exascale computing systems has adopted its own programming model and corresponding ecosystem. Moreover, portability between different models, where intended, remains in its relative infancy for all practical purposes.

So as to avoid getting locked into a particular vendor’s programming model and to take advantage of its extensive user support and math library, Zhang’s team opted to prepare PETSc for GPUs by using the vendor-independent Kokkos as their portability layer and as their primary backend wherever possible (otherwise relying on CUDA, SYCL, and HIP).

Instead of writing multiple interfaces for different vendor libraries, the researchers employ the Kokkos math library, known as Kokkos-Kernels, as a wrapper. Kokkos, by virtue of being a library, benefitted the team by letting them consider their users’ choice of programming model, thereby enabling seamless and natural GPU support.

Expanding GPU support

Prior to the efforts of Zhang’s team, which DOE’s Exascale Computing Project (ECP) sponsors, PETSc support for GPUs was limited to NVIDIA processors and required many of its compute kernels to execute on host machines. This had the effect of minimizing both the code’s portability and its capability.

“So far, we think adopting Kokkos is successful, as we only need a single source code,” Zhang said. “We had direct support for NVIDIA GPUs with CUDA. We tried to duplicate the code to directly support AMD GPUs with HIP. We find it is painful to maintain duplicated code: the same feature needs to be implemented at multiple places, and the same bug needs to be fixed at multiple places. Once CUDA and HIP application programming interfaces (APIs) diverge, it becomes even more difficult to duplicate a code.”

However, while PETSc is written in C, enough GPU programming models use C++ that Zhang’s team has found it necessary to add an increasing number of C++ files.

“Within the ECP project, bearing in mind a formula in computing architecture known as Amdahl’s law, which suggests that any single unaccelerated portion of the code could become a bottleneck to overall speedup,” Zhang explained, “we tried to consider the GPU-porting job and the portability of the GPU code in holistic terms.”

Optimizing communication and computation

The team is working to optimize GPU functionality on two fronts: communication and computation.

As the team discovered, CPU-GPU data synchronizations must be carefully isolated to avoid the tricky and elusive bugs they effect.

Therefore, to improve communication, the researchers have added support for GPU-aware Message Passing Interfaces (MPI), thereby enabling data to pass directly to GPUs instead of buffering on CPUs. Moreover, to remove GPU synchronizations that result from current MPI constraints on asynchronous computation, the team researched GPU-stream-aware communication that, bypassing MPI altogether, passes data using the NVIDIA NVSHMEM library. The team is also collaborating with Argonne’s MPICH group to test new extensions that address the MPI constraints, as well as a stream-aware MPI feature developed by the group.

For optimized GPU computation, Zhang’s team ported to device a number of functions intended to reduce back-and-forth copying of data between host and device. For example, while matrix assembly—essential for PETSc use—was previously carried out on host machines, its APIs could not be feasibly parallelized on GPUs, despite their friendliness with CPUs. The team added new matrix assembly APIs suitable for GPUs, improving performance.

Improving code development

Aside from recognizing the importance of avoiding code duplication and of encapsulating and isolating inter-processor data synchronizations, the team has learned to profile often (relying on NVIDIA nvprof and Nsight Systems) and to inspect the timeline of GPU activities in order to identify hidden and unexpected activities (and subsequently eliminate them).

One crucial difference between the Intel Xe GPUs that will power Aurora and the GPUs contained in other exascale machines is that the Xes have multiple subslices, indicating that optimal performance hinges on NUMA-aware programming. (NUMA, or non-uniform memory access, is a method for configuring a group of processors to share memory locally.)

Reliance on a single source code enables PETSc to run readily on Intel, AMD, and NVIDIA GPUs, albeit with certain tradeoffs. By making Kokkos a sort of intermediary between PETSc and vendors, PETSc becomes dependent on the quality of Kokkos. The Kokkos-Kernel APIs must therefore be optimized for vendor libraries to avoid impaired performance. Discovering that certain key Kokkos-Kernels functions are unoptimized for vendor libraries, the researchers contribute fixes to address issues as they arise.

As part of the project’s next steps, the researchers will help the Kokkos-Kernels team add interfaces to the Intel oneMKL math kernel library before testing them with PETSc. This, in turn, will aid the Intel oneMKL team as they prepare the library for Aurora.

Zhang noted that to further expand PETSc’s GPU capabilities, his team will work to support more low-level data structures in PETSc along with more high-level user-facing GPU interfaces. The researchers also intend to work with users to help ensure efficient use of PETSc on Aurora.

==========

The Best Practices for GPU Code Development series highlights researchers’ efforts to optimize codes to run efficiently on the ALCF’s Aurora exascale supercomputer.

The Argonne Leadership Computing Facility provides supercomputing capabilities to the scientific and engineering community to advance fundamental discovery and understanding in a broad range of disciplines. Supported by the U.S. Department of Energy’s (DOE’s) Office of Science, Advanced Scientific Computing Research (ASCR) program, the ALCF is one of two DOE Leadership Computing Facilities in the nation dedicated to open science.

Argonne National Laboratory seeks solutions to pressing national problems in science and technology. The nation's first national laboratory, Argonne conducts leading-edge basic and applied scientific research in virtually every scientific discipline. Argonne researchers work closely with researchers from hundreds of companies, universities, and federal, state and municipal agencies to help them solve their specific problems, advance America's scientific leadership and prepare the nation for a better future. With employees from more than 60 nations, Argonne is managed by UChicago Argonne, LLC for the U.S. Department of Energy's Office of Science.

The U.S. Department of Energy's Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science