Toward Recovery Capabilities in Message Passing Environments

Wesley Bland
Seminar

There is a known direct relationship between the size of computing resources and their failure rate. As the scale of these platforms become increasingly extreme, the requirements for application fault tolerance are following the same trend. Automatic approaches have a small investment cost, but their scalability is questionable at the magnitude of future machines. More promising techniques toward improving the resilience of application's intrinsic algorithms have been developed, but they currently receive no support from the programming model, and without such support, they are bound to fail. This talk discusses two approaches to failure mitigation, one in the context of the current MPI Standard (Version 3.0) and one using an extension to the MPI standard, called Checkpoint-on-Failure (CoF) and User Level Failure Mitigation (UFLM) respectively. Experiments demonstrate the capabilities of these two techniques, and highlight that a fault-aware MPI implementation can have little to no impact on performance for a range of applications, while producing satisfactory recovery times when failures occur.