A finalist for the 2024 Gordon Bell Prize, the team's multimodal protein design framework used five of the world’s top supercomputers, including the ALCF’s Aurora exascale system.
Harnessing the power of artificial intelligence (AI) and the world’s fastest supercomputers, a research team led by the U.S. Department of Energy’s (DOE) Argonne National Laboratory has developed an innovative computing framework to speed up the design of new proteins.
On the heels of this year’s Nobel Prize in Chemistry, which recognized advances in computational protein design, Argonne’s AI-driven approach has been selected as a finalist for the prestigious Gordon Bell Prize. Presented by the Association of Computing Machinery, the annual prize awards breakthroughs in using high performance computing to solve complex science problems.
One of the key innovations of the team’s MProt-DPO framework is its ability to integrate different types of data streams, or “multimodal data.” It combines traditional protein sequence data with experimental results, molecular simulations and even text-based narratives that provide detailed insights into each protein’s properties. This multimodal approach has the potential to accelerate protein discovery for a wide range of applications.
“Say you want to build a new vaccine or design an enzyme that can break down plastics for recycling in an environmentally friendly way,” said Arvind Ramanathan, Argonne computational biologist. “Our AI framework can help researchers zero in on promising proteins from countless possibilities, including candidates that may not exist in nature.”
Mapping a protein’s amino acid sequence to its structure and function is a long-standing research challenge. Each unique arrangement of amino acids—the building blocks of proteins—can yield different properties and behaviors. The sheer volume of potential variations makes it impractical to test them all through experiments alone.
To put this in perspective, modifying just three amino acids in a sequence of 20 creates 8,000 possible combinations. But most proteins are far more complex, with some research targets containing hundreds to thousands of amino acids.
“For example, if we change the position of 77 amino acids within a 300-amino-acid protein, we’re looking at a design space of a Googol, or 10100, unique possibilities,” said Gautham Dharuman, Argonne computational scientist and lead author on a paper introducing the framework. “This is why we need large language models and supercomputers to help explore this vast space in a reasonable amount of time.”
Large language models (LLMs), which form the basis of chatbots like ChatGPT, are AI models that are trained on large amounts of data to detect patterns and generate new information. In the realm of science, LLMs help researchers sift through massive datasets, providing insights and predictions for complex problems like protein design.
Building and training the framework’s LLMs required using powerful supercomputers, including the Aurora exascale system at the Argonne Leadership Computing Facility (ALCF). The ALCF is a DOE Office of Science user facility.
“The language models we trained are on the order of a few billion parameters,” said Venkat Vishwanath, AI and machine learning team lead at the ALCF. “Supercomputers are crucial not only for training and fine-tuning the models, but also for running the end-to-end workflow. This includes performing large-scale simulations to verify the stability and catalytic activity of the generated protein sequences.”
In addition to Aurora, the team deployed their framework on other top systems: Frontier at DOE’s Oak Ridge National Laboratory, Alps at the Swiss National Supercomputing Centre, Leonardo at CINECA center in Italy, and the PDX system at NVIDIA. They achieved over one exaflop of sustained performance (mixed precision) on each machine, with a peak performance of 5.57 exaflops on Aurora. The Argonne system recently earned the top spot in a measure of AI performance, achieving 10.6 exaflops on the HPL-MxP benchmark.
Surpassing an exaflop, which equals a quintillion calculations per second, highlights the immense computational power required for this effort.
“By adapting our workflow to run on multiple top supercomputers spanning diverse architectures, we've demonstrated the framework’s portability and scalability,” Vishwanath said. “This was important because it shows that our tool can be used by researchers regardless of the machine or location.”
The DPO in MProt-DPO stands for Direct Preference Optimization. The DPO algorithm helps AI models improve by learning from preferred or unpreferred outcomes. By adapting DPO for protein design, the Argonne team enabled their framework to learn from experimental feedback and simulations as they happen.
“If you think about how ChatGPT works, humans provide feedback on whether a response is helpful or not. That input is looped back into the training algorithm to help the model learn your preferences," Ramanathan said. “MProt-DPO works in a similar way, but we replace human feedback with the experimental and simulation data to help the AI model learn which protein designs are most successful.”
While generative AI techniques like LLMs have been developed for biological systems, existing tools have been limited by their ability to incorporate multimodal data. MProt-DPO, however, includes experimental data and text-based narratives that give added context to each protein’s behavior. This approach builds on earlier work by Ramanathan and colleagues, who created a text-guided protein design framework.
“Our motivation was to create a framework that can use LLMs and an end-to-end workflow to generate protein sequences with specific properties of interest such as fitness or catalytic activity,” Dharuman said. “DPO then uses these measures as feedback to align the LLMs, enabling them to generate more preferred outcomes in the subsequent iterations. We employed supercomputers to show that we can greatly reduce the time-to-solution by incorporating this feedback in the design process.”
Ramanathan noted that using experimental data also helps improve the trustworthiness of their AI models.
“Bringing validated results into the design loop helps prevent the models from hallucinating wild or unrealistic sequences,” he said. “This results in more reliable protein designs.”
The team tested MProt-DPO on two tasks to demonstrate its ability to handle complex protein design challenges. First, they focused the yeast protein HIS7, using experimental data to improve the performance of various mutations. For the second task, they worked on malate dehydrogenase, an enzyme that plays a key role in how cells produce energy. Using simulation data, they optimized the design of the enzyme to improve its catalytic efficiency.
The team is collaborating with Argonne biologists to validate the AI-generated designs in a laboratory, where initial tests have shown they are performing as expected.
The creation of MProt-DPO is also helping to advance Argonne’s broader AI for science and autonomous discovery initiatives. The tool’s use of multimodal data is central to the ongoing efforts to develop AuroraGPT, a foundation model designed to aid in autonomous scientific exploration across disciplines.
“Demonstrating that this approach delivers strong scientific results at extreme scales is an important step toward building more robust AI models,” Ramanathan said. “It also moves us closer to autonomous discovery, where AI can help streamline not only experiments but the entire scientific process.”
The team’s research was supported by the DOE Office of Science’s Advanced Scientific Computing Research program and the National Institutes of Health.
Additional team members include Argonne’s Kyle Hippe, Alexander Brace, Sam Foreman, Väinö Hatanpää, Varuni K. Sastry, Huiho Zheng, Logan Ward, Servesh Muralidharan, Archit Vasan, Bharat Kale, Carla M. Mann, Heng Ma, Murali Emani, Michael E. Papka, Ian Foster and Rick Stevens; Yun-Hsuan Cheng, Yuliana Zamora and Tom Gibbs from NVIDIA; Shengchao Liu from the University of California, Berkeley; Chaowei Xiao from the University of Wisconsin-Madison; Mahidhar Tatineni from the San Diego Supercomputing Center; Deepak Canchi, Jerome Mitchell, Koichi Yamad and Maria Garzaran from Intel; and Anima Anandkumar from the California Institute of Technology.