Towards Application-Attuned I/O Acceleration and Multi-Tiered Data Management for HPC and AI Workloads

Avinash Maurya, Argonne National Laboratory
Seminar
MCS Seminar Graphic

To support the growing scale and complexity of modern scientific and deep-learning workloads, exascale HPC systems use thousands of GPUs to achieve unprecedented computational capabilities. These applications often exceed GPU memory capacity, necessitating supplemental capacity from slower memory tiers. To mitigate I/O overheads from these slower tiers and interconnects, multi-tier asynchronous checkpointing is employed, supporting several use cases, such as capturing and revisiting the entire history of intermediate data states, out-of-core adjoint computations, data inconsistency analysis, provenance, reproducibility, and job preemption, thereby demonstrating versatile I/O patterns. Current checkpointing and data management runtimes are inefficient at handling large, frequent, distributed datasets at scale under concurrency. This presentation introduces two specialized checkpointing runtimes to address these limitations: VeloC-GPU and DataStates-LLM. VeloC-GPU features application-guided eviction, state-machine-based data lifecycle management, accelerated memory pool operations, and co-optimized compression and transfer schedules. DataStates-LLM leverages immutable LLM training phases and multi-level asynchronous checkpointing to speed up throughput by up to 48x and reduce runtime by 2.2x. Our solutions significantly reduce I/O overheads, enabling effective and efficient high-frequency checkpointing for a range of productive, defensive, and administrative tasks.

ADD TO CALENDAR