Fast Checkpoint for Extreme Scale Supercomputers

Leonardo Bautista Gomez
Seminar

In high performance computing, scientific applications need to make progress despite frequent failures. Thus, long running executions are periodically checkpointed to stable storage. Nowadays, the overhead imposed by parallel file system based checkpointing is about 25% of execution time. In future exascale supercomputers, checkpointing will become prohibitively time consuming. We developed a fault tolerance interface that exploits the features of large scale hybrid systems implementing a low-overhead high-frequency multi-level checkpoint that uses a Topology-Aware Reed-Solomon encoding algorithm with modern local storage devices, advanced clustering techniques and Fault Tolerance Dedicated Threads. Finally, we develop an exascale study using our performance model and we show that our approach can guarantee low overhead in future extreme scale systems.