Going Beyond Diagonal Preconditioners for Training Neural Networks At-Scale

Hao-Jun Michael Shi, Meta Platforms
LANS Seminar Graphic featuring the title and date for the event.

Diagonal scaling-based adaptive subgradient methods such as AdaGrad and Adam(W) have dominated neural network training over the past decade due to their simplicity of implementation, linear memory and computational requirements, and robustness to hyperparameter tuning. While theoretically superior full-matrix preconditioned adaptive subgradient methods exist, their quadratic memory and cubic computational costs prevent them from being applied at-scale. However, recent developments in block-diagonal Kronecker factorization-based preconditioned methods (i..e, Shampoo) that capture useful uncentered correlations within each layer have practically demonstrated faster convergence than diagonal scaling methods while remaining tractable in terms of required memory and compute. In this talk, we provide an overview of the Shampoo algorithm and detail our state-of-the-art distributed PyTorch implementation, including: (1) the heuristics required to make the algorithm work in practice; (2) the performance optimizations necessary to make Shampoo competitive in terms of per-iteration wall-clock time against diagonal scaling methods; and (3) the developments necessary to scale Shampoo to train billion or trillion-parameter models. Experiments on ImageNet ResNet50 and the AlgoPerf benchmark demonstrate Shampoo’s potential for training neural networks more efficiently across a broad range of applications.

Bio: Hao-Jun Michael Shi is a Research Scientist in the AI and Systems Co-Design team at Meta Platforms. He obtained his B.S. degree in Applied Mathematics from the University of California, Los Angeles, and his Ph.D. from Northwestern University in Industrial Engineering and Management Sciences. He received the Walter P. Murphy Fellowship and the NSF Graduate Research Fellowship Honorable Mention in 2016 and 2017, and was a top ICML reviewer in 2019. His current research interests are in the design and implementation of scalable and distributed training algorithms and systems for deep learning. He has previously contributed to the areas of stochastic optimization, noisy optimization, and derivative-free optimization as well as recommender systems and embedding compression. 

See all upcoming talks at https://www.anl.gov/mcs/lans-seminars