Wafer Scale Computing: What it Means for AI and What it May Mean for HPC

Rob Schreiber, Cerebras Systems Inc
Cerebras’ Wafer Scale Engine silicon

Abstract

Deep learning needs more performance than what CPUs provide, and the demand is growing faster than Moore’s Law. Both training and inference have been a new market for the GPU.  Yet while GPUs do the job better than CPUs, the GPU is not optimized for neural networks, and new, better adapted architectures are now appearing. Yet, while AI-optimized architecture boosts performance, it will does meet the demand, and something else is needed.  Wafer-scale computing is proving to be part of the solution.

Cerebras Systems has developed and delivered (to ANL, LLNL, GSK, and other customers) a reliable, manufacturable wafer-scale chip and system, the CS-1, aimed at training and inference.   The largest chip ever made, the Cerebras Wafer-Scale Engine is 60 times larger than the largest CPU and GPU chips.  On it there are 400,000 compute cores that provide petaflops of performance, 18 gigabytes of fast SRAM memory with over ten petabytes of bandwidth, and a communication network with 50 petabits of bandwidth.  I will present the Cerebras system and discuss the technical problems concerning yield, packaging, cooling, and delivery of electrical power that had to be solved to make it possible, and talk about the programming models possible and in use now for training.

Of course, the high performance computing market, including at ANL, is also keen to achieve better performance for simulation though the numerical solution of differential equations.  We recently did an experimental evaluation, in collaboration with NETL, of the CS-1 for the model problem of solving a large sparse system of linear equations posed on the regular mesh in 3D using the BiCGstab method, a typical Krylov subspace solver.   On traditional systems, both memory bandwidth and communication latency limit performance and prevent strong scaling for such computations, which do not cache well and which require frequent collective communication.  We achieved performance two orders of magnitude better than the best possible on a CPU cluster, because these limiting factors are no limits at all on the wafer scale system.  With 18GB of on wafer memory, there is a limit to the size of problem we can solve this way.  I will discuss the future growth of the technology and the implications of the possibility of strong scaling and extreme performance for problems of modest memory footprint.

Speaker biography

Rob Schreiber is a Distinguished Engineer at Cerebras Systems, Inc., where he works on architecture and programming of systems for accelerated training of deep neural networks. Before Cerebras he taught at Stanford and RPI and worked at NASA, at startups, and at HP.  Schreiber’s research spans sequential and parallel algorithms for matrix computation, compiler optimization for parallel languages, and high performance computer design. With Moler and Gilbert, he developed the sparse matrix extension of Matlab.  He created the NAS CG parallel benchmark.  He was a designer of the High Performance Fortran language. Rob led the development at HP of a system for synthesis of custom hardware accelerators.  He has help pioneer the exploitation of photonic signaling in processors and networks. He is an ACM Fellow, a SIAM Fellow, and was awarded, in 2012, the Career Prize from the SIAM Activity Group in Supercomputing.

Please use this link to attend the virtual seminar:

https://bluejeans.com/409960760