Large-scale distributed training of Deep Neural Network (DNN) models reveals performance issues on High-Performance Computing (HPC) clusters. On the one hand, the effectiveness of DNN models heavily depends on large training datasets (e.g., Terabyte-scale), making data loading challenging in today's distributed training. On the other hand, second-order optimizers offer improved convergence and generalization in DNN training but come with extra data communication overhead compared to stochastic gradient descent (SGD) optimizers. Therefore, reducing communication costs is crucial for the performance of second-order optimizers.
To address these problems, I will discuss two system-level optimizations: SOLAR and SSO. SOLAR utilizes offline and online scheduling strategies to optimize the data loading cost from parallel filesystems to device memory (e.g., Graphics Processing Unit - GPU). SSO avoids latency-bounded communication and integrates lossy compression algorithms to reduce communication message size while preserving the benefits of second-order optimizers, such as faster convergence compared to SGD-based optimizers. Specifically, I will describe the challenges of data loading and communication in large-scale distributed training, share our insights on performance improvements, and explain how SOLAR and SSO address these challenges and issues.