Towards a Resource Efficient Framework for Distributed Deep Learning Applications

Jingoo Han, Virgina Tech
Webinar
AI for Science report

Distributed deep learning is widely used for solving critical scientific problems with massive datasets. However, to accelerate the scientific discovery, resource efficiency is also important for the deployment on real-world systems, such as high-performance computing (HPC) systems. Deployment of existing deep learning applications on these distributed systems may lead to underutilization of HPC hardware resources. In addition, extreme resource heterogeneity has negative effects on distributed deep learning training. However, much of the prior work has not focused on specific challenges in distributed deep learning including HPC systems and heterogeneous federated systems, in terms of optimizing resource utilization. This presentation addresses the challenges in improving resource efficiency of distributed deep learning applications, through (1) Performance analysis on deep learning for supercomputers, (2) GPU-aware deep learning job scheduling, (3) Topology-aware Virtual GPU training, (4) Heterogeneity-aware adaptive federated learning scheduling, and (5) Tokenized incentive algorithm for federated training.