Checkpointing is the most widely used approach to provide resilience for HPC applications by enabling restart in case of failures. However, coupled with a searchable lineage that records the evolution of intermediate data and metadata during runtime, it can become a powerful technique in a wide range of scenarios at scale: verify and understand the results more thoroughly by sharing and analyzing intermediate results (which facilitates provenance, reproducibility, and explainability), new algorithms and ideas that reuse and revisit intermediate and historical data frequently (either fully or partially), manipulation of the application states (job pre–emption using suspend–resume, debugging), etc.
This talk advocates a new data model and associated tools (DataStates, VELOC) that facilitate such scenarios. Avoid direct use of a data service to read and write datasets; instead, during runtime, users should tag datasets with properties that express hints, constraints, and persistency semantics. Doing so will automatically generate a searchable record of intermediate data checkpoints, or data states, optimized for I/O. Such an approach brings new capabilities and enables high performance, scalability, and FAIR–ness through a range of transparent optimizations. The talk will introduce DataStates and VELOC, will underline several vital technical details, and will conclude with several examples of where they were successfully applied.
Bogdan Nicolae is a Computer Scientist with Argonne National Laboratory (Chicago, USA) and Research Professor at Illinois Institute of Technology (Chicago, USA). He specializes in scalable storage, data management and fault tolerance for large scale distributed systems, in particular at the intersection of high-performance computing, big data analytics and artificial intelligence. He is interested by and authored numerous papers in areas such as checkpoint-restart, state capture and migration, data and metadata decentralization and high availability, concurrency control in data management, multi-versioning and historic access, declarative data models, live migration. He is a regular PC member and participates in the organization of major international conferences around parallel and distributed systems: SC, IPDPS, HPDC, CCGrid, CLUSTER, ICS, HIPC, ICDCS, ICPP, EuroPar, EuroMPI, etc. He is a regular reviewer for journals such as: TPDS, JPDC, FGCS, PARCO, TC, TCC, IJHPCA.