Compiler Support for Loop Optimizations in Supercomputing

Michael Kruse, Argonne National Laboratory


The computational core of many scientific-computing programs are various kinds of loops. Some very-often-used kernels are hand optimized by the application programmer or hardware vendor, but not all code can get this level of attention. Compilers can assist with choosing and performing loop optimizations, providing performance portability while retaining maintainability in various ways. 

One approach supports adding directives in the source code that instruct the compiler which transformation to apply, which is relatively simple to implement in a compiler but immensely useful to separate performance optimizations from program semantics. Many compilers already support this in form of unroll and vectorization pragmas. OpenMP also follows this approach for thread parallelization and target offloading. OpenMP 5.1 will add support for unrolling and tiling, with more transformations planned for OpenMP 6.0.

A second approach, called autotuning, uses machine learning to guide trying different optimizations and automatically find the fastest in the search space. Since the autotuner itself cannot ensure the correctness of a transformation for all inputs, the compiler has to apply a semantic legality analysis before applying each transformation.

The compiler can also heuristically apply transformations which are likely to improve performance. This is a hard problem because the compiler has an incomplete performance model and only local, static program properties available. While there is no single answer for the best possible optimization of every application and hardware combination, a compiler framework that makes it easy to implement new -- possibly application-specific -- loop transformations and heuristics might eliminate the need to optimize individual performance-critical loops by hand.

Please use this link to attend the virtual seminar: