Performance Modeling and Optimization for Cloud Computing and HPC

Sheng Di
Seminar

In this talk, I will mainly discuss my research performed during the past 4 years, which is about modeling and optimization of key issues for cloud computing and high-performance computing (HPC). As part of my research in cloud computing, I proposed a set of optimization strategies with resource isolation technology over a virtual machine environment. Some key issues include minimization of cloud task wall-clock time, minimization of user payment on resource consumption, analysis of worst-case execution bound, and optimization of fault tolerance. Then I will discuss with more details the research concerning the modeling and optimization of multilevel checkpoint/restart in the context of high-performance computing. Multilevel checkpoint/restart is a promising approach that provides an elastic response to tolerate different types of failures. It stores checkpoints at different levels: for example, local memory, remote memory, using a software RAID, local solid-state disk, and remote file system. A set of optimization strategies about multilevel checkpoint intervals are devised to minimize the wall-clock time of parallel HPC applications in the case of different types of failures. The principal challenge is how to deal with the different levels of checkpoint/restart overheads and different types of failure events, as well as their mutual impact. One key aspect of the methodology used in my research is the association of several complementary techniques: modeling, analytical problem solving when possible, simulation, and experiments performed using real-cluster or HPC systems. To end this talk, I will present the context of my future research plan at Argonne.