Characterization and Modeling of Error Resilience in HPC Applications

Luanzheng Guo, University of California, Merced

Abstract:  As supercomputers continue to increase computational power and size, next-generation HPC systems are expected to incur a super higher failure rate than contemporary systems. Transient faults caused by high energy particle strikes, wear-out, etc. are becoming a critical contributor to in-field system failures. Transient faults can lead to Silent Data Corruption (SDC), which can impact scientific results without users realizing it. Thus, how to ensure scientific computing integrity in the presence of faults remains one of the grand challenges for large-scale HPC systems. In this talk, the speaker will present and introduce how they understand nature error resilience in HPC applications, and how they characterize and model application resilience on data objects.

