Scientific discovery increasingly depends on complex workflows consisting of multiple phases and sometimes millions of parallelizeable tasks or pipelines. These workflows access storage resources for a variety of purposes, including pre-processing, simulation output, and post-processing steps. Unfortunately, most workflow models focus on the scheduling and allocation of resources for tasks while the impact on storage systems remains a secondary objective and an open research question. While some workflow engines collect telemetry information and provide them to users, I/O activity on storage performance is usually not accounted for.
In this talk, we present an approach to augment the I/O activity associated with individual tasks of workflows by combining workflow description frameworks with I/O monitoring data from Darshan/TOKIO. A conceptual architecture along with a prototype implementation for HPC data center deployment are introduced. We also uncover and discuss challenges which will need to be addressed by workflow management and monitoring systems for HPC in the future. Finally, a demonstration will show how real world applications and workflows could benefit from the approach as well as an outlook on how this helps to ease communication with users about performance related advice.
Bio:
Jakob Luettgau is a researcher at the German Climate Computing Center (DKRZ) and a PhD Student at the University of Hamburg, Germany, with a focus on the modeling, analysis and architecture of I/O systems for HPC.
Within the Centre of Excellence in Simulation of Weather and Climate in Europe he is developing a middleware to store netCDF/HDF5 datasets more efficiently by distributing data/metadata across different storage services and tiers within a data center.