ALCF hackathon helps researchers boost performance on Polaris supercomputer

outreach
ALCF Group Hackathon

ALCF staff welcomed 114 researchers to Argonne for the INCITE GPU Hackathon. 

The ALCF hosted its fourth GPU Hackathon to help attendees improve their application performance in preparation for the INCITE call. 

The Argonne Leadership Computing Facility (ALCF) recently welcomed over 100 researchers and developers to the U.S. Department of Energy’s (DOE) Argonne National Laboratory for a hackathon designed to boost code performance on the facility’s supercomputing resources. The annual event paired teams with ALCF mentors to scale and optimize applications in preparation for the 2025 call for proposals for DOE’s Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program.

Both current INCITE awardees and future applicants were able to get hands-on time on the ALCF’s Polaris system and learn firsthand how to fine-tune their codes to run efficiently on the high-powered supercomputer. Polaris is a Hewlett Packard Enterprise system with NVIDIA GPUs (graphics processing units) and AMD CPUs (central processing units). The ALCF is a DOE Office of Science user facility at Argonne.

Hackathon Image

Attendees were paired with ALCF staff and shared their progress in optimizing their codes throughout the hackathon. 

“More than anything we want attendees to be successful in running their applications on our machines,” says ALCF computational scientist Yasaman Ghadar, who organized this year’s event. “We pair our expert staff with hackathon attendees, so they feel comfortable asking questions while working on their code, and build what we hope to be strong and lasting relationships."

A total of 114 people across 20 teams attended this year’s hackathon. The event featured an INCITE panel discussion and several presentations that included tips for running AI and machine learning frameworks on Polaris and Aurora, an overview of tools and software development kits, and a general discussion on the future of AI for science. Attendees also presented talks to share their progress in optimizing their codes during the hackathon. 

​​Hammad Farooq, a Ph.D. student at the University of Illinois Chicago, attended the hackathon to advance his research team’s computational readiness for the INCITE call. Their research aims to use the ALCF’s Aurora exascale supercomputer to construct detailed 3D genome-folding models to better understand the relation between genome structure and function. 

“Previously, our code was implemented considering the NVIDIA GPU architecture for Polaris,” says Farooq. “However, since we are planning to apply for time on Aurora in our INCITE application, and Aurora is based on Intel GPU architecture, our main goal was to adapt our implementation to be compatible with Aurora and Polaris.”

During the workshop, the team focused on porting their source code from OpenACC to OpenMP target offload to prepare for Aurora.

alcf hackathon graphic

Teams met from May 21-23, 2024, at Argonne National Laboratory, to collaborate on their code. 

Before the hackathon, Farooq had never worked with the OpenMP programming model. During the event, he learned how to use OpenMP and implemented it for a significant portion of his team’s code. 

“The highlight of the event was successfully achieving our goal of porting our code from OpenACC to OpenMP,” says Farooq. “This accomplishment ensured our code's compatibility with Aurora and significantly enhanced our understanding of adapting code for different GPU architectures.”

Niksa Praljak, a Ph.D. student at the University of Chicago, was part of a hackathon team that is pursuing research at the intersection of deep generative modeling and protein design for synthetic biology. Their work requires the training of large-scale AI models and datasets.

“I was so excited about the hackathon,” says Praljak. “I was ready to collaborate and scale our current code for distribution training with Polaris’s NVIDIA A100 GPUs across hundreds of compute nodes.” 

The event marked a significant milestone for the team, as they were able to scale their model to an unprecedented 3 billion parameters and distribute it over 512 GPUs, something they had yet to achieve. 

“This achievement, which led to nearly linear scaling, is a testament to our ability to push the boundaries of what is possible,” says Praljak.

Rachit Kumar, a Ph.D. student at the University of Pennsylvania, attended the hackathon to use ALCF resources to advance his team’s efforts to develop novel methods for analyzing genetics data.

“My team and I wanted to learn more about what areas of our code to dedicate our time to optimizing,” says Kumar. “Beyond that, we also wanted to properly scale our calculations to multi-node, multi-GPU systems to ensure that it scaled well with minimal bottlenecks.” 

During the hackathon, Kumar and his team spent their initial time profiling their code to find and address any performance bottlenecks.

“We managed to improve the performance of our code by over 60%, even on a single GPU, by identifying places where unnecessary host-to-device transfers were occurring,” says Kumar.

After making the necessary optimizations, the team found that their code worked much better than expected. Kumar and his team leveraged the code enhancements to apply for an INCITE award and plan to continue working with the ALCF on this project. 

“We learned that by using the computational power at the ALCF, we can go beyond what we were hoping to do, which was mostly computing these pairwise correlations for others to use in downstream analyses,” says Kumar. “Now, we are hoping to extend our code further to glean even more complex insights from these pairwise correlations as well as to compute other metrics that we previously thought might be intractable.” 

For information about upcoming training opportunities, visit our events page: https://www.alcf.anl.gov/events 

Systems