Fault tolerance for Exascale parallel executions: leveraging Applications and System properties

Franck Cappello
Seminar

In 2020-2022, Exascale systems will be put in service in multiple locations in the world. Studies and projections agree that these systems will suffer more frequent failures and data corruptions than current systems. The challenge is clear: how to make sure that Exascale application executions complete and provide correct results? Finding solutions to this problem is not trivial. In particular scaling existing solutions will not work. In this talk we present a disruptive approach: exploring applications and systems to discover properties that could be leveraged to develop Exascale fault tolerance solutions. We will present the results of this approach in four domains: multi-level checkpoint/restart, signal analysis based failure prediction, domino free fault tolerant protocols and silent data corruption detection based on data analytics. We will also discuss the limits of these new techniques.

BIO:
Franck Cappello is the Senior Computer Scientist at Argonne National Laboratory where he leads the research on fault tolerance/resilience for extreme scale systems. He is the director of the Inria-Illinois-ANL-BSC-JSC joint laboratory on extreme scale computing (http://publish.illinois.edu/jointlab-esc/) that explores and develops new software addressing key challenges of extreme scale numerical simulations and data analytics. He led the resilience road map for the IESP (International Exascale Software Project) and EESI  (European Exascale System Initiative) efforts. He also initiated and directed several international collaborations like the G8 "Enabling Climate Simulation at Exascale" project. Cappello received his Ph.D. from the University of Paris XI in 1994 and joined CNRS, the French National Center for Scientific Research. In 2003, he joined INRIA, where he held the position of permanent senior researcher until 2013. In 2009, Cappello also became visiting research professor at the University of Illinois at Urbana Champaign. He joined Argonne National Laboratory in May 2014.