Data Storage
The ALCF’s data storage system is used to retain the data generated by simulations and visualizations. Disk storage provides intermediate-term storage for active projects, offering a means to access, analyze, and share simulation results. Tape storage is used to archive data from completed projects.
Disk Storage
The ALCF has Lustre file systems and GPFS file systems for data storage:
Grand:
A Lustre file system residing on an HPE ClusterStor E1000 platform equipped with 100 Petabytes of usable capacity across 8480 disk drives. This ClusterStor platform provides 160 Object Storage Targets and 40 Metadata Targets with an aggregate data transfer rate of 650GB/s.
Primary use of Grand is Compute campaign storage. Also see File Systems, Data Sharing, Data Policy, and Data Transfer.
Eagle:
A Lustre file system residing on an HPE ClusterStor E1000 platform equipped with 100 Petabytes of usable capacity across 8480 disk drives. This ClusterStor platform also provides 160 Object Storage Targets and 40 Metadata Targets with an aggregate data transfer rate of 650GB/s.
Primary use of Eagle is data sharing with the research community. Also see File Systems, Data Sharing, Data Policy, and Data Transfer.
Tape Storage
ALCF computing resources share three 10,000-slot libraries using LTO6 and LTO8 tape technology. The LTO tape drives have built-in hardware compression with compression ratios typically between 1.25:1 and 2:1, depending on the data, giving an effective capacity of approximately 65PB. Also see Data Transfer and HPSS.
Networking
Networking is the fabric that ties all of the ALCF’s computing systems together.
InfiniBand enables communication between system I/O nodes and the various storage systems described above. The Production HPC SAN is built upon NVIDIA Mellanox High Data Rate (HDR) InfiniBand hardware. Two 800-port core switches provide the backbone links between eighty edge switches, yielding 1600 total available host ports, each at 200Gbps, in a non-blocking fat-tree topology. The full bisection bandwidth of this fabric is 320Tbps. The HPC SAN is maintained by the NVIDIA Mellanox Unified Fabric Manager (UFM), providing Adaptive Routing to avoid congestion, as well as the NVIDIA Mellanox Self-Healing Interconnect Enhancement for InteLligent Datacenters (SHIELD) resiliency system for link fault detection and recovery.
When external communications are required, Ethernet is the interconnect of choice. Remote user access, systems maintenance and management, as well as high performance data transfers are all enabled by the Local Area Network (LAN) and Wide Area Network (WAN) Ethernet infrastructure. This connectivity is built upon a combination of Extreme Networks SLX & MLXe routers and NVIDIA Mellanox Ethernet switches.
ALCF systems connect to other research institutions over multiple 100Gbps Ethernet circuits that link to many high performance research networks, including local and regional networks like the Metropolitan Research and Education Network (MREN), as well as national and international networks like the Energy Sciences Network (ESnet) and Internet2.