HPC platforms and application are becoming increasingly complex. Consequently, protecting results against all forms of corruption and ensuring trustworthiness are becoming more important. While previous work focuses on application-specific detectors, the dataflow manager in our current work in the Decaf project aims to have an efficient generic mechanism. We address those issues using new replication patterns that rely on the use of an auxiliary method and an external learning observer. In this talk, we present both the theoretical validation mechanisms and different use cases where our mechanism can be applied to detect silent data corruption.
Bio:
Hadrien Croubois is a fourth year student at the École Normale Supérieure de Lyon, France. After completing his master's degree in image analysis and processing last year (2014), he spent a year working on the Decaf project at Argonne National Laboratory. He was recently awarded Ph.D. and teaching assistant funding at the Laboratoire de l'Informatique du Parallélisme (LIP), Lyon, France where he will study autonomous work and data placement on grid infrastructures.