Mitigating Silent Data Corruptions in HPC Applications across Multiple Program Inputs?

Yafan Huang, University of Iowa
MCS Seminar Graphic

With the ever-shrinking size of transistors, silent data corruptions (SDCs) are becoming a common yet serious issue in HPC. Selective instruction duplication (SID) is a widely used fault-tolerance technique that can obtain high SDC coverage with low-performance overhead. However, existing SID methods are confined to single program input in its assessment, assuming that error resilience of a program remains similar across inputs. Nevertheless, we observe that the assumption cannot always hold, leading to a drastic loss in SDC coverage in different inputs, compromising HPC reliability. We notice that the SDC coverage loss correlates with a small set of instructions – we call them incubative instructions, which reveal elusive error propagation characteristics across multiple inputs. We proposed MINPSID, an automated SID framework that identifies incubative instructions in programs and re-prioritizes incubative instructions. Evaluation shows MINPSID can effectively mitigate the loss of SDC coverage across multiple inputs.


Bio: Yafan Huang is a Ph.D student from University of Iowa. His current research interest includes resilience such as investigation and detection of silent data corruptions, scientific lossy compression, performance optimization. The presentation he is going to give is based on his paper accepted by SC22 recently and selected as the best paper finalist.