First International Symposium on Checkpointing for Supercomputing (SuperCheck21)
NERSC is hosting the First International Symposium on Checkpointing for Supercomputing (SuperCheck21), which will be held February 4-5, 2021. This free event will be held online and will feature the latest work in checkpoint/restart research, tools development, and production use.
- Call for Participation Release: September 24, 2020
- Abstract Submission Due: December 7, 2020 (AoE)
- Acceptance Notification: December 20, 2020
- Presentation Submission Due: Jan 22, 2021 (AoE)
- Symposium: February 4-5, 2021
About the Workshop
Checkpoint/Restart (C/R) is critical for fault-tolerant computing in high-performance computing (HPC). While there has been much research and development on C/R and C/R tools, few HPC end users are able to use these tools in production workloads. Although research codes often demonstrate promising C/R capabilities, there are no feasible C/R options for diverse production workloads, especially on cutting-edge HPC systems. In this workshop, we will bring together C/R researchers, practitioners, application developers, and end users to share both the latest research results and experiences on adopting C/R tools in production. The goal of this workshop is to showcase the latest research on C/R, motivate the development of usable C/R tools, and boost the adoption of C/R tools in HPC production workloads. Paper submissions will be peer-reviewed, and a venue for accepted papers will be identified. We encourage PhD students and HPC end users to participate.
The workshop scope includes any and all aspects of checkpointing for science and engineering in the High Performance Computing (HPC) context, including the latest research results and development, deployment, and application experiences. The workshop scope includes but is not limited to:
C/R research and tools development:
- C/R targeting the full range of supercomputing software, including MPI, OpenMP, GPGPU software, FPGAs, cloud, and container applications, etc.
- Both pure and hybrid approaches to transparent checkpointing (some examples of hybrid approaches are: application-specific plugins to aid in checkpointing; and integrated modules for transparent checkpointing as part of larger scientific/engineering toolkits)
- Frameworks for multi-level checkpointing
- The development of new methods for low-overhead checkpointing, newer fundamental algorithms, software development methods, the impact of future supercomputer hardware, performance evaluation, and reproducibility, fault recovering
- Research on C/R scheduling and intervals
C/R use in production (including all levels of checkpointing: application, job, and system levels):
- The adoption of transparent C/R tools in production workloads (C/R use cases)
- The application-initiated use of C/R tools (alternative to built-in internal checkpointing)
- C/R applications and support on HPC systems (e.g., resource scheduling, system utilization, batch system integration, best practice, etc.)
We encourage participation from researchers and end-users, professionals and students.