Resilience is a growing concern for large-scale simulations. As failures become more frequent, alternatives to global checkpointing that limit the extent of needed recovery become more desirable. Additionally, platforms differ in both error rates and types, therefore, a flexible and customizable recovery strategy can be very helpful to the applications running on these platforms. Applications often have structures that provide logical confinement spaces that can be exploited for this purpose. We investigate a customizable recovery strategy in the context of structured adaptive mesh refinement (SAMR). We exploit the inherent granularities and hierarchy in SAMR to limit the impact of faults for localized recovery, and identify tunable parameters for customizing the strategy depending upon the application and platform behavior. We use Global View Resilience (GVR) library, which provides global versioning arrays for application-controlled state saving as our resiliency interface.
Bio:
Anshu Dubey received her Ph.D. in computer science from Old Dominion University in 1993. She then joined the University of Chicago Astronomy & Astrophysics Department as a research associate. In 2001 she joined the ASC/Flash Center where she was associate director from 2009-2013. From 2013 to 2015 she was on the staff at Lawrence Berkeley National Laboratory in the Applied Numerical Algorithms Group. In 2015 she joined the Mathematics and Computer Science Division at Argonne as a computer scientist. She has two decades of experience in development of complex scientific software and has earned wide recognition for her contributions.