Two areas of concern that have emerged from several DOE meetings on exascale systems (machines with 100 million cores) are runtime systems which can function at that scale, and fault management. The Fault Oblivious Exascale (FOX) project aims to build a software stack that combines the management of these two issues; a work-queue based runtime which is designed to naturally accommodate failure as just another event in which a computational component failed to complete.
The team is exploring fault isolation and recovery across the entire stack from the operating system, through the runtime, up into the application. The core of this approach is based upon a fault-tolerant distributed data store, and a task management system built on top of that. The approach will provide both file system interfaces to these systems services, as well as more tightly coupled runtime interfaces to support a wide range of programming models. As there are no exascale systems, INCITE time on a petascale system will be used to test the FOX environment, specifically:
- New quantum chemistry kernel implementations using a work queue mode, developed by PNNL, SNL and OSU.
- A petascale implementation of a distributed data store based on Kyoto Cabinet, being ported to HPC platforms by LLNL and SNL.
- SESA, a new HPC OS for petascale systems, developed by BU, IBM and U. Karlsruhe.
- An asynchronous graph traversal application based on a distributed work-queue model, developed at LLNL.
The result of this work will be new applications environments for DOE use, and results from software and library development that the vendor can use to guide development of future exascale systems.