OpenMP Programming Model

Help Desk


The OpenMP API is an open standard for parallel programming. OpenMP is expected to provide a portable programming model available across systems with Intel, Nvidia, and AMD GPUs. For Aurora, the device offloading features of OpenMP 4.5 and beyond will provide the capability to offload kernels to the Intel GPUs.

On Aurora the nodes will be composed of a mix of CPUs and GPUs. The CPUs are typically the “host” devices where programs begin running. However, the majority of the computational power and memory bandwidth can be accessed only from the GPUs. To run on a GPU, the two main goals are to transfer data to the device from the host and execution control/code to the device.

openMP transferring data

Intel’s C/C++ and Fortran next generation compiler, which are part of the oneAPI HPC toolkit, will provide an optimized OpenMP implementation for offloading to Intel GPUs. OpenMP is a directive-based API and offloading execution control, transferring data, and marking regions of parallelism are expressed with pragmas. Some of the more commonly used pragmas for offloading are shown below.

Common Pragmas for Offloading

Offloading code to run on accelerator Distributing iterations of the loop to threads Controlling data transfer between devices
#pragma omp target
[clause[[,] clause],…] structured-block
#pragma omp teams
[clause[[,] clause],…] structured-block
([map-type:] list ) map-type:=alloc | tofrom | from | to | …
#pragma omp declare target
#pragma omp distribute
[clause[[,] clause],…] for-loops
#pragma omp target data
[clause[[,] clause],…] structured-block
#pragma omp declare variant* 
(variant-func-id) clause new-line
function definition or declaration
#pragma omp loop*
[clause[[,] clause],…] for-loops
#pragma omp target update
[clause[[,] clause],…]


Simple examples of offloading in C/C++ and Fortran are shown below. 

int main(void)
  int N = 1<<20; // 1M elements
float *x = new float[N];
  float *y = new float[N];

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;

  // Run kernel on 1M elements on the GPU with OpenMP
    #pragma omp target teams distribute parallel for map(x[0:N],y[0:N])
  for (int i = 0; i < N; i++)
    y[i] = x[i] + y[i];

  // Free memory 
  delete [] x;
  delete [] y;

  return 0;

An example in Fortran:

program add
      integer ::  i
      real    ::  x(500),y(500)

! initialize x and y arrays on the host                                                                                                                                                                             
      do i=1,500

! Run kernel on arrays on the GPU with OpenMP                                                                                                                                                                       
!$omp target teams distribute parallel do                                                                                                                                                                                                   
      do i=1,500
!$omp end target teams distribute parallel do                                                                                                                                                                                            

      end program add