Preparing a legacy event-generator code for GPU

science
coding-best-practices

As part of a series aimed at sharing best practices in preparing applications for Aurora, we highlight researchers' efforts to optimize codes to run efficiently on GPU-based systems.

As part of an ongoing Aurora Early Science Program (ESP) project to prepare the ATLAS experiment at CERN’s Large Hadron Collider (LHC) for the exascale era of computing—“Simulating and Learning in the ATLAS Detector at the Exascale,” led by Walter Hopkins— researchers are porting and optimizing the codes that will enable the experiment to run its simulation and data analysis tasks on an array of next-generation architectures. Among them is the soon-to-launch Intel-HPE Aurora system housed at the Argonne Leadership Computer Facility (ALCF), a U.S. Department of Energy (DOE) Office of Science user facility at Argonne National Laboratory.

One such code is MadGraph, a particle interaction simulator for LHC experiments that performs particle-physics calculations to generate expected LHC-detector particle interactions. As a framework, MadGraph aims at a complete Standard Model and Beyond Standard Model phenomenology, including such elements as cross-section computations as well as event manipulation and analysis.

Nathan Nichols, a postdoctoral appointee at the ALCF, joined the ATLAS ESP project to study performance portability frameworks in the context of the MadGraph code to enable experiments like ATLAS to use modern supercomputers for their large computational needs.

Best practices

 

  • -Ensure code base is easily maintainable.
  • -Comparative analysis of different versions via performance tools, if feasible, offers the most precise way to evaluate relative performance.
  • -Continuous integration pipelines can simplify performance comparisons.

 

Lessons learned

 

  • -Minor negative impacts resulting from offloading certain operations might not be worth the effort necessary for their correction.

Defining performance portability for MadGraph

“To me, to be portable and performant means that the application needs to be capable of running efficiently and effectively on as many devices as possible—irrespective of vendor, whether it’s NVIDIA or Intel or AMD,” Nichols said. “If an application runs really well on Intel GPUs but has problems running on NVIDIA GPUs, it’s not really performance portable. As a developer you also want the code to be easily maintainable, so you don’t want to have a patchwork of different chunks of code as your code base—you want to write one code and have that code be performant on all different devices.”

Because MadGraph, which was originally developed over a decade ago, had a legacy code base that enables users to simulate most physical processes of interest at the LHC, realizing portability was less straightforward than would be expected for a traditional standalone or scientific application.

“The application contains Python scripts that write Fortran code to generate a simulation of whatever physics experiment is running at CERN at the time; it’s very generic and writes code on the fly—potentially a new set of source code for each physics process, depending on the experiment,” Nichols said. “It could be challenging to port the application performantly, or to write a performant GPU kernel because the kernel could be any sort of particle configuration, and we need to be able to generate and run on devices effectively and efficiently whatever physics process that physicists might want to explore. That was somewhat daunting to tackle.”

Determining which portability framework to adopt

The team had already settled on testing three portability frameworks, SYCL, Kokkos, and alpaka, when Nichols joined the project. He would take the lead on developing the SYCL version.

Five representative physics processes were chosen as standard cases to test code performance.

When it became time to work in earnest on the SYCL port of MadGraph, Nichols’s first step was to examine the native CUDA code to identify areas in which performance gains could be made.

“After everything was written and we had working versions of the software for different portability frameworks, we needed to narrow down our options to determine which made the most sense to support in the future,” he explained.

Nichols led the Argonne effort to measure and compare the performance of the frameworks across multiple architectures.

Using GitLab, Nichols set up continuation integration software pipelines in order to carry out regular performance tests on the various devices hosted at Argonne’s Joint Laboratory for System Evaluation (JLSE). The systems on which these nightly performance tests ran included Nvidia GPUs (V100 and A100), Intel GPUs (early versions of those in Aurora), Intel CPUs (Skylake), and AMD GPUs (MI50, MI100, MI250).

The testing setup afforded by JLSE systems made judging frameworks for performance straightforward in terms of the five different physics processes under evaluation: Nichols would begin with computationally simple physics process before progressively ramping up the level of computational difficulty. Throughout the testing Nichols would make slight alterations to the software stack to evaluate how performance was affected.

He conducted performance scaling to see which portability framework delivered the best performance across the different GPUs. Outperforming even the native CUDA and CPU codes, it was eventually determined that the SYCL port was the most performant on all tested systems, with Kokkos a close second. Given these metrics, the ATLAS team chose to move forward with SYCL and discontinue development with the other portability frameworks.

With the SYCL port selected, Nichols updated the Python-based code generator in the MadGraph framework to optionally output the SYCL matrix element calculations. Originally, this only generated the Fortran-based matrix element code for the user-specified physics process.

While Nichols worked on the SYCL port, the CERN team had developed a code-mixing bridge to facilitate the Fortran code’s ability to call the functions in the multiple C++ portability libraries being tested. 

“We needed the SYCL library and the Fortran code to talk to each other,” Nichols said. “But we couldn’t get the SYCL library and Fortran code to link properly, on account of the different programming languages in play, which was the first of many bugs we discovered. Luckily my team had been working on the problem closely with contacts at Intel, and now all of those bugs are being taken care of and we should be able to smooth them out.”

Dealing with high register pressure

Some of those bugs stem from issues that the MadGraph code has with high register pressure. A register file is an allocation of space in which different objects can be stored, with critical impacts on GPU performance. Register pressure, loosely speaking, is the term for when a register file is approaching its maximum capacity.

“Once the registry is full, the objects inside have to be transferred to a different memory location, which takes time, and then the emptied registry begins filling with new items,” Nichols said. “Now an application I’m running has to call items not just from the ordinary register file, but from throughout the system’s global memory.”

Apart from correcting dips in performance that resulted from transfers triggered by register pressure, the MadGraph team has had to debug the code in order to take advantage of available performance profiling tools, which themselves require register-file reservation.

“Since we’re having these register spills, the performance tools can’t reserve that space themselves, so we can’t get really in-depth analysis of what our code is doing. This in turn means that we have to rely on educated guesses informed by prior programming experience,” Nichols said.

Toward deployment on Aurora

The SYCL port of MadGraph has displayed superior performance on Intel GPUs compared to other versions of the application to date.

After running the SYCL port on the entirety of the Aurora test and development system Sunspot—which has 128 available nodes—Nichols and the ATLAS team began tuning I/O and communication for MadGraph-generated files to ensure efficient functionality when deployed at scale on Aurora.

Capable of delegating an individual process separately to each GPU with linear scaling, MadGraph is an embarrassingly parallel application.

“Because MadGraph is embarrassingly parallel, scaling is not a worry as far as Aurora deployment goes,” Nichols pointed out.

On the other hand, as the full Aurora system comes online, the ATLAS team must measure whether the current workflow developed for the code is effective on the exascale computer’s vast array of compute nodes.

Nichols also developed a custom math function library that allows for swapping primitive data types with SYCL vector types without breaking the MadGraph code.

“The SYCL vector types allow the code to take advantage of vector instruction sets available on CPUs giving a performance boost on those devices,” he said. “Using SYCL vector types in this ad hoc way is in contrast to the recommended approach, which is to rely on auto-vectorization and use function widening. However, for a large legacy codebase like MadGraph, conventional approaches are often insufficient to gain the desired performance.” He added that the library still requires further revision because—while use with the SYCL vector type delivers strong performance on CPUs—on GPUs its performance slows considerably, and its compilation time increases substantially.

Nichols intends to improve the library through systematic testing of the code, and by consulting with other developers to glean a diverse array of perspectives.

Such collaboration has played a large role in bringing MadGraph to exascale and in tuning Aurora for future users; regular workshops with Intel staff at the Center of Excellence, in particular, helped originate ideas for improving the performance of the MadGraph code and pointing out improvements in the compilers. ESP projects in general are an essential vehicle for resolving issues inherent to the rollout of complex, large-scale HPC systems.

The SYCL portability framework allows developers to launch parallel kernels (that is, methods for initiating running code on GPUs) using different methods, which led to experimentation and comparison with different kernel launch methods involving basic data parallel kernels, work group data parallel kernels, and hierarchical data parallel kernels; hierarchical data parallel kernels were found to perform best.

“The relevant piece of the SYCL specification has been under revision recently, so I’ve been interested in helping with that process, to which end I developed some different test applications,” Nichols mentioned. “Just testing the various options available under the SYCL specification has been the best way to determine how to improve performance.”

The Argonne ATLAS team explored offloading different parts of the full software pipeline to GPU, but Nichols discovered that almost all of the performance bottlenecks could be attributed to matrix element calculations. By being GPU-offloaded, these calculations were accelerated to such a degree that elements of the code localized to CPU operation experienced minor negative impacts. (The effort necessary to offload the impacted elements outweighed any performance boosts that could be gained by doing so.)

Nichols is currently working toward the MadGraph release, which entails completing documentation, testing and ensuring that all physics processes are functioning correctly, and honing the SYCL port to ensure code maintenance is as simple and straightforward as possible for users. These efforts are intended to culminate in extending the ATLAS project to Aurora via eventual deployment of the SYCL port.