Enabling Resilient and Portable Workflows from DOE’s Experimental Facilities

PI Katie Antypas , Lawrence Berkeley National Laboratory
Co-PI Debbie Bard, Lawrence Berkeley National Laboratory
Tom Uram, Argonne National Laboratory
Venkat Vishwanath, Argonne National Laboratory
Mallikarjun Shankar, Oak Ridge National Laboratory
Suhas Somnath, Oak Ridge National Laboratory
Project Summary

This ALCC project will enable research into experimental and observational data workloads, which differ from the traditional simulation workloads that run at large-scale computational facilities. The team's work will help define the architectural and technical roadmap for experimental and observational facility workflows running at DOE’s HPC facilities, expose the cross-facility policy challenges, and offer strategies to address them.

Project Description

The size and complexity of data from DOE’s experimental and observational facilities (EOFs) is already overwhelming scientists' ability to manage, analyze, search, and model it, and data set sizes are expected to grow dramatically in the next decade. As a result, scientists from these facilities are increasingly turning to high-performance computing (HPC) facilities for their workloads including large-scale data analysis, AI, and simulation and modeling. These workflows often have different requirements from traditional simulation workloads based on their need for near real-time feedback, experiment-time availability, and resilience. HPC Facilities are often optimized for high utilization and single large-scale jobs. Further, the first-of-a-kind technologies deployed at DOE’s HPC facilities mean a period of hardening often occurs before systems reach full stability, and even upon reaching production deployment, downtimes are needed for technology and software upgrades that can again affect stability. Resolving this impedance mismatch requires creative and innovative solutions to provide resilience for complex workflows originating from EOFs.

There have been a number of studies and much research funded to address application portability, which primarily focuses on portable programming models and node architectures. More research and experimentation is needed on how to enable portable complex cross-facility workflows. The investigation areas include determining primitives and portable abstraction levels for scheduling, data movement, and data management challenges. Portable, high-performing, complex workflows would enable a new generation of EOFs with throughput and deadline-driven computing requirements to access and benefit from HPC capabilities.

This ALCC project will enable research into experimental and observational data (EOD) workloads, which differ from the traditional simulation workloads that run at large-scale computational facilities. With some adaptation—often recasting as large high-throughput ensembles—EOD applications have been run at large scale across the DOE complex. Notable examples include simulation from high energy physics and astrophysics, and analysis of data from DOE light sources. One goal of this effort will be to generalize these successes to other applications and domains, simplifying the task of outfitting future workloads to run on large-scale parallel resources, ensuring data is available where it is needed and optimizing job execution for greater efficiency.

To this end, the ALCC team will partner with EOFs to define abstractions and APIs for their workloads, mapping onto the generalized infrastructure that they establish for running their jobs, and adapt them as needed. The researchers anticipate that the partner facilities would request dedicated allocations in the future to continue along this path. A second goal will be to prototype flexibility in scheduling jobs across the compute facilities. While the architecture of the machines at the compute facilities varies (e.g., CPU and GPU deployments), recent container technologies have allowed users to deploy applications and their dependencies to target these features robustly and transparently. In such a context, it is now possible to consider the notion of an application "run" in a fashion agnostic to the target facility. This portability will also be explored in the context of workloads from the team’s partner EOF facilities. Ultimately, this work will define the architectural and technical roadmap for EOF workflows running at DOE’s HPC facilities, expose the cross-facility policy challenges, and offer strategies to address them.

Allocations