Argonne augments Theta supercomputer with GPUs to accelerate coronavirus research

announcements
NVIDIA DGX A100 installation

ALCF staff members augment Theta with the installation of NVIDIA DGX A100 nodes. (Image: Argonne National Laboratory)

With funding from the CARES Act, Theta has been upgraded to provide additional computing power in support of COVID-19 research.

Earlier this year, the U.S. Department of Energy’s (DOE) Argonne National Laboratory unveiled an upgrade to its high-performance computing (HPC) resource Theta. The new hardware stands to substantially improve the performance and capability of the Theta supercomputer, an Intel-HPE/Cray XC40 system housed at the Argonne Leadership Computing Facility (ALCF). The ALCF is a U.S. DOE Office of Science User Facility.

Deployed rapidly in response to the global pandemic, the Theta upgrade and its supporting systems are currently being leveraged in the fight against the coronavirus and associated COVID-19 disease. Funding for the hardware was provided by the Coronavirus Aid, Relief and Economic Security (CARES) Act, signed into law in March.

With a core that combines both graphics processing unit (GPU) and central processing unit (CPU) capabilities, the augmented Theta architecture adds 24 NVIDIA DGX A100 nodes to the existing system. Each DGX A100 node comprises eight NVIDIA A100 Tensor Core GPUs and two AMD Rome CPUs that provide 320 gigabytes (7680 GB aggregately) of GPU memory for training artificial intelligence (AI) datasets, while also enabling GPU-specific and GPU-enhanced HPC applications for modeling and simulation.

For system documentation, visit the Theta User Guide.

To be considered for an allocation award, please submit your request using the ALCF's Discretionary Allocation Request form.

A 15-terabyte solid-state drive offers up to 25 gigabits per second in bandwidth for each node. The compute interconnect comprises 24 Mellanox QM9700 HDR200 40-port switches wired in a non-blocking fat-tree topology.

“This upgrade dramatically amplifies the computational power of Theta,” said Mark Fahey, ALCF Director of Operations. “The capabilities it presents will accelerate the complex and diverse workloads that define contemporary scientific computing, integrating data analytics with AI training and learning into a single platform.”

The DGX A100 nodes’ integration into Theta is achieved via the ALCF's Cobalt HPC scheduler; initial scheduling is available at the node level, with individual GPU-level scheduling coming in the near future.

“This additional hardware, while set to enable important research in its own right, represents a stepping stone to using the advanced GPU-accelerated systems due for arrival in the near future—that is, Polaris and Aurora,” said Katherine Riley, ALCF Director of Science.

Like the Theta upgrade, Polaris, the ALCF's next machine, features heterogeneous architecture that will utilize both CPUs and GPUs. As many of its capabilities are GPU-derived, Polaris will be a scalable bridge that prepares ALCF users for the Aurora exascale machine.

The augmented Theta hardware was fully dedicated to coronavirus research when deployed in May. While coronavirus research will remain the system’s top priority, Argonne is expanding availability to the broader user community.

“The difference in computational performance accelerated our work almost instantly,” said Arvind Ramanathan, a computational biologist at Argonne who leads a group of researchers aiming to unravel the fundamental biological mechanisms of the virus while also identifying potential therapeutics to treat the disease. “Our data-intensive workloads—which combine AI, machine learning techniques, and molecular dynamics simulations—require significant brute force for the pace of discovery to proceed at a reasonable rate.”

With the impending arrival of Polaris and Aurora—and the exascale era with the latter—that pace will continue to accelerate.

===========

The Argonne Leadership Computing Facility provides supercomputing capabilities to the scientific and engineering community to advance fundamental discovery and understanding in a broad range of disciplines. Supported by the U.S. Department of Energy’s (DOE’s) Office of Science, Advanced Scientific Computing Research (ASCR) program, the ALCF is one of two DOE Leadership Computing Facilities in the nation dedicated to open science.

Argonne National Laboratory seeks solutions to pressing national problems in science and technology. The nation’s first national laboratory, Argonne conducts leading-edge basic and applied scientific research in virtually every scientific discipline. Argonne researchers work closely with researchers from hundreds of companies, universities, and federal, state and municipal agencies to help them solve their specific problems, advance America’s scientific leadership and prepare the nation for a better future. With employees from more than 60 nations, Argonne is managed by UChicago Argonne, LLC for the U.S. Department of Energy’s Office of Science.

The U.S. Department of Energy’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time. For more information, visit https://​ener​gy​.gov/​s​c​ience.

Systems