Major exascale reports indicate that future HPC systems will suffer shorter Mean Time Between Failures (MTBF) due to the increase in system complexity and continued size decrease in hardware components. For such unreliable computing systems, it is reasonable for application users to explicitly manage the response from frequent system failures. Traditionally, checkpoint-restart (CR) has been a popular resilience enhancement for application users, but incurring some undue cost associated with the access to secondary storage (distributed IO) and the global restart of parallel programs. Interestingly, anecdotal evidences suggest that the majority of large scale HPC application failures attributes to failures at single node. If this holds, the traditional CR makes use of unnecessary system resource to contain any scales of application failures, thereby suggesting a new approach to adapt the scale of failures.
We have proposed Local Recovery Local Failure (LFLR) concept to make parallel applications to recover locally for single node (local) failures without global program termination and restart. In joint-effort with Rutgers University, we have developed a prototype software, FENIX, to realize scalable online application recovery using MPI-ULFM (a fault tolerant MPI prototype). In this talk, we will discuss the architecture of FENIX and its capability.
Bio:
Dr. Keita Teranishi is a principal member of technical staff of the Scalable Modeling and Analysis Systems Department at Sandia National Laboratories in Livermore, CA. He has been involved in research on HPC application resilience and asynchronous many-task parallel programming models for extreme scale systems.
Dr. Teranishi holds a Ph.D. In computer science from The Pennsylvania State University, where he did graduate work on parallel sparse linear system solvers and their applications. Prior to working at Sandia, he was Software engineer at Cray Inc. where he supported a number of scientific libraries including ScaLAPACK, PETSc and Trilinos. He was also the main developer of automatic-performance tuning dense and sparse matrix kernels for Cray MPP families and dense matrix libraries for GPU-based systems.