Improving Performance Portability in OpenCL Programs

Yao Zhang
Seminar

The recent development of OpenCL provide an open, portable C-based programming model for highly parallel processors. Although the initial focus of OpenCL is to offer functional portability, performance portability is a critical feature for it to be widely adopted. In this talk, I will present our study on the performance portability of OpenCL across diverse architectures including NVIDIA GPU, Intel Ivy Bridge CPU, and AMD Fusion APU for three exemplar benchmarks: SGEMM, SpMV, and FFT. We identify a number of tuning knobs that are critical to performance portability, including threads-data mapping, data layout, tiling size, caching and prefetching, etc. We further demonstrate that proper tuning could improve the OpenCL portable performance from the current 15% to a potential 67% of the state-of-the-art performance on the Ivy Bridge CPU. Finally, we evaluate the current OpenCL programming model, and propose a list of extensions that improve performance portability.