In 2021, the Argonne Leadership Computing Facility (ALCF) will deploy Aurora, a new Intel-Cray system. Aurora, will be capable of over 1 exaflops. It is expected to have over 50,000 nodes and over 5 petabytes of total memory, including high bandwidth memory. The Aurora architecture will enable scientific discoveries using simulation, data and learning. The detailed architecture of Aurora is protected by RSNDA (Restricted Secret Nondisclosure Agreement).
For the Aurora Early Science Program (ESP) Data and Learning call, we will select 10 new projects, all chosen competitively based on this call for proposals. With this CFP, we are specifically targeting applications in the areas of Data and Learning. These should have strong aspects of data science (Big Data, data-intensive computing, experimental/observational/simulation data analytics, etc.) and/or machine learning (deep learning, neural networks, discovery of patterns and reduced models for scientific data and/or simulation modeling, etc.). Cross-cutting proposals targeting the convergence of simulation, data and learning are very much encouraged.
With Aurora being a dramatically bigger and faster machine than Theta or Mira, the three months of pre-production Early Science time will be a large and valuable allocation of core-hours, with the potential for truly unprecedented computational science—as well as being the United States’ first exascale system. ALCF will fully fund 10 postdoctoral appointees for Aurora ESP—one for each selected project.
Below is a rough timeline for the Aurora ESP. The rows labeled “A21 ESP projects” denote the central effort of the projects: developing, porting, and tuning code for the target system:
Before the plan to shift from a 2018-delivery (180 petaflops) Aurora based on third-generation Intel® Xeon Phi™ processors to the 2021-delivery (1 exaflops) Aurora, we had already selected 10 Aurora ESP projects. These will continue, and serve as the Simulation based projects targeting Aurora 2021 (A21). This is reflected in the topmost, red bar in the timeline figure. An important part of the shift to A21 is the shift from primarily traditional, simulation-based computing at ALCF to expanded scope including data-centric and machine/deep learning projects. We now refer to Simulation, Data, and Learning as the “three pillars” of leadership computing going forward. This call for proposals is for projects in the Data and Learning pillars.
Details we can provide about the A21 system in this call are very few. The speed and scale of A21 will be vastly greater than today’s systems, or systems on the near-term horizon. We realize that this poses a substantial challenge to proposal authors, especially in the areas of Data and Learning, where there is limited history of applications running at leadership scale; and we will take this into account when evaluating proposals. We believe that the development and optimization for current large-scale computers such as ALCF’s Theta, NERSC’s Cori, OLCF’s Titan, or OLCF’s forthcoming Summit system, will form a solid basis to develop and optimize for Aurora 2021.
Once ESP projects have been selected, project teams will sign RSNDA agreements and learn about the Aurora architecture. They will have access to a sequence of simulators, compilers, precursor hardware testbeds, and some of the earliest-available processors in the form of white boxes. This development ecosystem, together with training and assistance from ALCF and the system vendors, should allow the teams to achieve a high level of readiness by the time the Aurora hardware arrives in 2021.
Cross-cutting proposals targeting the convergence of simulation, data and learning are very much encouraged.
ALCF will directly support efforts on the ESP projects by hiring postdocs to work with some of the project teams. We will fund up to 10 dedicated postdocs, who will be employed by ALCF, but work directly with project investigators on efforts needed to prepare for Early Science runs. Some of these may split time between the ALCF and project-investigator locations. Generally, a postdoc will work on only one Aurora ESP project.
We will assign one ALCF staff scientist to each ESP project. This person will spend a fraction of his/her time collaborating with the ESP project and mentoring the postdoc on computational aspects of the project.
Another type of support for the ESP projects will be access to applications experts from the Aurora vendors—Intel and Cray. These expert consultants will be made available through various avenues to assist with ESP code porting and tuning.
ALCF and Intel/Cray will offer training targeted toward the ESP projects. This will include a virtual kick-off workshop about the Aurora hardware and programming the system, and a hands-on workshop as soon as we have sufficient Aurora-generation hardware to support testing and debugging of project applications. Before then, we will also offer access to advanced Aurora simulators and hardware as described under “Proposing to Unknown Hardware and Exascale” above, and allocations on our production system Theta for ESP development work that does not depend strongly on having the new hardware (e.g., new algorithms, new physics modules, basic introduction of threads). Where appropriate, we will offer joint training opportunities with OLCF, in support of as much portability as possible among our future systems and OLCF’s IBM/NVIDIA-based Summit system.
The Early Science period on Aurora is a span of about three months, beginning after system acceptance, but before turnover to production users. During this time, projects selected for the Aurora ESP will have dedicated access to the full system, with large allocations of time in support of their major scientific run campaigns. Access will continue for the remainder of a year, but will be shared with the production users.
Proposals for the Aurora ESP must have a plan for the science to be accomplished, and a description of what application development will occur throughout the duration of the project. In addition, each selected project’s home institution must pursue an appropriate Non-Disclosure Agreement (NDA) with Intel/Cray for access to needed information on the next-generation architecture. ALCF will provide instructions for obtaining the NDAs to the selected project teams.
The Proposal Instructions should include everything needed to submit a proposal, including a document template. These are, roughly, a simplified version of an INCITE proposal. Please direct any questions to earlyscience@alcf.anl.gov. As part of a DOE user facility collaboration on application readiness, the proposal form will ask for disclosure of participation in OLCF’s CAAR program (the equivalent of ESP at the other LCF center), and the Exascale Computing project.
ALCF, with the help of internal and external science-domain experts, will evaluate proposals on the strength of
We expect ESP projects to share best practices and lessons learned—the fruits of your labors developing, porting, and optimizing your applications for Aurora—in technical reports and in presentations to the community. ALCF will organize a community workshop or webinar series after the end of the Early Science period. We also expect projects to share and publish their scientific results in appropriate venues, acknowledging the ALCF and its Early Science Program.
We expect project participants to partner with ALCF and Intel/Cray to assess robustness and correctness of the new hardware and software—to help identify the root cause of problems and find potential workarounds.
Our intent is for Aurora ESP proposals to be relatively simple and short—a stripped-down version of an INCITE proposal. The sections of the proposal are
Please direct any questions to earlyscience@alcf.anl.gov. If needed, contact Tim Williams at 630-252-1154.