Performance Tuning for GPUs - An Iterative Process

Karl Rupp
Seminar

Despite all the improvements to both hardware and the software stack, programming graphics processing units (GPUs) for scientific applications remains a challenge, particularly if the resulting code should run on other, a-priori unknown GPUs. In this talk we present various levels of optimization for obtaining performance-portable GPU code: First, we start off with a standard conjugate gradient formulation and minimize data transfer for better performance. In a second step, we provide tuning results for common memory-bandwidth limited linear algebra operations across different hardware generations of the same vendor as well as across vendors. Finally, we rewrite the conjugate gradient algorithm in a mathematically equivalent way to improve data reuse.