New computing framework streamlines the use of AI and supercomputers for protein structure prediction

science
SARS-CoV-2 spike protein

As one of the most important therapeutic targets for drug discovery, spike protein is subject to many drug modality developments. This figure of a SARS-CoV-2 spike protein shows trimeric receptor binding domain in a prefusion closed conformation. (Image by Hyun Park and Eliu Huerta/Argonne National Laboratory and University of Illinois Urbana-Champaign.)

With help from the ALCF's Polaris supercomputer, researchers from Argonne and the University of Illinois Urbana-Champaign developed the APACE framework to optimize AlphaFold2 to run at scale on HPC systems, providing a tool that significantly reduces time to solution. 

A team of researchers from the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign, the U.S Department of Energy’s (DOE) Argonne National Laboratory, and the University of Chicago have developed a novel computational framework that simplifies and speeds up the process of using artificial intelligence (AI) tools and algorithms to understand 3D protein structure.

The framework, detailed in a new paper published in the Proceedings of the National Academy of Sciences, also predicts conformational diversity of proteins, an important property since proteins are malleable structures that can flip between different conformations to do their job. The paper was authored by Roland Haas, a senior research programmer in NCSA’s Scientific and Engineering Applications Support group; Eliu Huerta, lead for translational AI at Argonne and senior scientist of the Consortium for Advanced Science and Engineering at the University of Chicago; Hyun Park, then an Illinois Ph.D. student in biophysics; and Parth Patel, an NCSA graduate research assistant.

The team developed APACE, a computational tool that effectively handles AlphaFold2, an AI program used to predict protein structure, on high performance computing (HPC) systems. They deployed APACE on the Delta supercomputer at NCSA to measure how well it performed predicting the structures of four exemplar proteins. Using up to 300 ensembles distributed across 300 NVIDIA A100 GPUs, they found that APACE is up to two orders of magnitude faster than off-the-shelf AlphaFold2 implementations. Moreover, the same approach could be used in a variety of scientific disciplines and could be linked with robotics laboratories to automate and accelerate scientific discovery. The team later reproduced the work on the Polaris supercomputer at the Argonne Leadership Computing Facility (ALCF). The ALCF is a DOE Office of Science user facility.

Foundation AI models have the potential to transform the practice of science if they are findable, accessible and ready to use by the broader scientific community,” said Huerta. ​This project demonstrates how to create and share the required scientific data infrastructure to truly democratize cutting-edge AI and leverage modern computing environments to maximize its science reach.”

Biomedical researchers study proteins to understand a wide range of biological functions. Proteins are chains of amino acids and their ordering into 3D structures determines biological functions. Understanding how proteins are formed — a process often called protein folding, in which amino acids come together in structured chains capable of carrying out specific functions — is crucial to understanding normal biological functions as well as how folding mistakes can lead to serious diseases.

Predicting protein folding is extremely computationally intensive since a typical protein can have hundreds of amino acids and thousands of cells that can combine in different ways. The usual methods for studying protein structure are X-6/1 crystallography, a tool for determining the atomic and molecular structure of a crystal, and cryo-EM, which involves flash-freezing molecules in liquid nitrogen and bombarding them with electrons to capture their images with a special camera.

AlphaFold and AlphaFold2 showed that AI software can accurately and quickly predict protein structure from amino acid sequences, and the development of APACE builds on this breakthrough.

APACE optimizes AlphaFold2 to run at scale on HPC platforms, and effectively handles its multiple terabyte protein database. The work shows that large AI models can be combined with the power of HPC to allow scientists to study multi-protein complexes and obtain results quickly, accurately and at higher resolution — all factors that could lead to a fuller understanding of protein structure and kickstart the development of new drugs that can treat many diseases.

Research in new drugs is extremely time consuming and bottlenecked by the need to synthesize different candidate compounds to test their medical effectiveness in a laboratory,” said Haas. ​“APACE allows drug researchers to drastically reduce the time required to screen out potential candidate compounds and thus focus on the most promising substances. This way more compounds can be tested and the time to develop a new drug, for example one tailored towards a specific viral strain, can be reduced.”

A key feature of APACE is better data management by hosting AlphaFold2’s multi-terabyte model and database on the supercomputer, from which the framework’s neural networks can readily access data. Other improvements include CPU optimization and GPU optimization to parallelize GPU-intensive neural network protein structure prediction steps.

The first problem with using an AI model is the storage of the data,” said Park, who, like Patel, was at Argonne for an internship when the work on APACE was done. ​We need to pass 2.6 terabytes (the size of the AlphaFold2 database) as well as the computation from sequence to structure prediction. Some university labs may be able to do that, but what matters is that you scale it up so that scientists around the world can use it.”

Added Patel, ​That’s why HPC utilization is important, especially for AI models. Anyone who can get into an HPC system can have access to both data and also the computational capability to do the actual AI model calculation. Not to mention there’s a huge speed increase.”

Huerta said the team chose to work with AlphaFold2 because it is used extensively in different research communities, including biophysics, chemistry, and drug design and discovery.

“APACE provides all the capabilities of the original AlphaFold2 model and empowers researchers with the ability to leverage supercomputers to reduce time-to-solution, and to connect this tool with self-driving laboratories to automate and accelerate discovery,” he said.

Huerta said the team will continue to build a community of APACE users to maximize the usability of AI models with HPC platforms. Haas said the team is now focused on attacking the remaining bottlenecks in the system to further improve speed. He’d also like to make APACE available on more compute clusters so that more scientists can take advantage of it.

We’d also like to explore using the methods we’ve developed to speed up AlphaFold2 with other foundational machine learning models that are too complex to easily use on common desktop workstations,” said Haas. ​It’s all about making the best tools available and as easy to use as possible.”

This article is based on a release from NCSA.

Systems