Theta/ThetaGPU Machine Overview

Help Desk

Hours: 9:00am-5:00pm CT M-F
Email: support@alcf.anl.gov

Theta

Theta

Theta is a Cray XC40 and consists of several types of nodes. Table 1 summarizes the system’s capabilities.

Please see below to find information on ThetaGPU.

Table 1: Theta Machine Overview

Theta Description Aggregate
Compute Nodes Intel KNL 7230 4,392
Compute Cores 64 281,088
Compute Memory - DDR4 192 GiB 843,264 GiB
Compute Memory - MCDRAM 16 GiB 70,272 GiB
Compute SSD 128 GiB 561,176 GiB
LNET Service node for Lustre 30
DVS Service node for Cray DVS 60
Compute Racks   24
LINPACK RMax (Rpeak) Top500 LINPACK results 6.92 PFLOP/s
(11.69 PFLOP/s)
Tier 2 Service Node 13
MOM Service Node 3
eLogin Login node 6
Project File system Lustre 10 PB
Home file system GPFS 1 PB
High Speed Network Aries Dragonfly 14400 rank 1 ports
14400 rank 2 ports
9600 rank 3 ports

Login Nodes

Theta has six login nodes that are Cray eLogin machines, which means they are nodes outside the main Cray system. These nodes are Intel Haswell E5-2698 v3 nodes with 256GB of DDR4 memory. These frontend nodes are used for editing code, building code, and submitting jobs. All users of the Theta system share these nodes. Please note that these nodes are not 100% binary compatible with AVX-512 instructions so applications compiled for the compute node will not run on the login nodes.

Service Nodes

Service nodes are Cray’s generic name for various nodes that provide internal infrastructure for the overall system. These nodes appear in the xtnodestat output as service nodes and consume part of the Node Identifier (NID) space. This is important with respect to job scheduling because requesting a NID that is a service node will cause your job not to run.

MOM

The Machine Oriented Mini-server (MOM) nodes run various part of the Cray software infrastructure. These are Intel E5-2695 v4 nodes with 128 GiB of DDR4 memory. These nodes also run the Cobalt scheduler and execute the user batch scripts. It is critical that users do not put computational or memory intensive tasks within the batch script and instead run those on the compute resources.

LNET

The Lustre LNET routers serve as a gateway between the high-speed Aries fabric and the Infiniband FDR storage network. These nodes are Intel Sandy Bridge E5-2670 with 64 GiB of DDR3 memory.

DVS

The Cray Data Virtualization Service (DVS) server provide a gateway between the high-speed Aries fabric and other external file systems. The DVS server primarily provide access to the GPFS file systems, such as the home file system. These nodes are physically identical to the LNET nodes, which are Intel Sandy Bridge E5-2670 with 64 GiB of DDR3 memory.

Tier 2

The Tier 2 nodes provide infrastructure to the Cray software stack and aggregate sets of compute nodes. These nodes are physically identical to the MOM nodes and have Intel E5-2695 v4 CPUs with 128 GiB of DDR4 memory.

Compute Nodes

Theta provides only a single compute node type: the Intel Knights Landing 7230 processors with 16 GiB of MCDRAM and 192 GiB of DDR4. These nodes have 64 cores each and each core has 4 SMT hardware threads available.

ThetaGPU

ThetaGPU is an extension of Theta and is comprised of 24 NVIDIA DGX A100 nodes. Each DGX A100 node comprises eight NVIDIA A100 Tensor Core GPUs and two AMD Rome CPUs that provide 320 gigabytes (7680 GB aggregately) of GPU memory for training artificial intelligence (AI) datasets, while also enabling GPU-specific and -enhanced high-performance computing (HPC) applications for modeling and simulation

The DGX A100’s integration into Theta is achieved via the ALCF’s Cobalt HPC scheduler and shared access to a 10-petabyte Lustre filesystem. Fixed ALCF user accounts ensure a smooth onboarding process for the expanded system.

A 15-terabyte solid-state drive offers up to 25 gigabits per second in bandwidth. The dedicated compute fabric comprises 20 Mellanox QM9700 HDR200 40-port switches wired in a fat-tree topology. ThetaGPU cannot utilize the Aries interconnect.

Table 1 summarizes the capabilities of a ThetaGPU compute node.

Table 1: ThetaGPU Compute Nodes Overview

COMPONENT PER NODE AGGREGATE
AMD Rome 64-core CPU 2 48
DDR4 Memory 1 TB 24 TB
NVIDIA A100 GPU 8 192
GPU Memory 320 GB 7,680 GB
HDR200 Compute Ports 8 192
HDR200 Storage Ports 2 48
100GbE Ports 2 48
3.84 TB Gen4 NVME drives  4 96

Login Nodes

The Theta login nodes (see above) will be the intended method to access ThetaGPU. 

References

Cray XC40
Cray DVS
Cray Aries
Intel KNL
Lustre
IBM GPFS
Top 500