Probabilistic Fault Detection and Diagnosis in Large-Scale HPC Applications

Cloud service providers charge customers based upon the amount of resources used or reserved. However, there are no guarantees on the quality-of-service (QoS) that the given resources will provide. Due to the highly dynamic nature of Internet workloads, increasing complexity of cloud-hosted applications, multi-tier architectures and the complex dynamics of underlying shared infrastructure, it is significantly challenging to manage application level performance in the Cloud. The situation is further complicated by the fact that cloud service providers need to control the power consumption in their data centers to avoid power capacity overload of high density servers, lower electricity costs and reduce their carbon footprint.

I have developed novel middleware approaches to autonomic management of virtualized resources for power and performance control in the Cloud. Firstly, I designed self-adaptive and efficient resource provisioning techniques based on machine learning and control theoretical techniques to guarantee a high percentile performance of multi-tier web applications in the face of highly dynamic workloads. I also developed an automation tool for resource allocation and configuration of Hadoop framework for cost-efficient big data processing in the Cloud. Secondly, I designed a system that simultaneously controls the power consumption and the performance of multi-tier applications in a virtualized server system. Furthermore, I developed a power-aware framework for managing scientific workloads in virtualized GPU computing environments, with the help of emerging technologies that provide GPU accelerators as virtualized computing resources. Lastly, I proposed and developed non-invasive, e nergy efficient and highly scalable mechanisms to achieve performance isolation of heterogeneous applications in a multi-tenant cloud system.

Argonne Leadership Computing Facility

Leadership Computing Resources

Featured: Aurora

Computational Science

Featured: Engineering

Growing the HPC Community

Accelerating Science

Support Center

Featured: Get Started

Probabilistic Fault Detection and Diagnosis in Large-Scale HPC Applications

01/28/2013, 4:30am CT