Getting Started on Sunspot

Help Desk

Aurora/Sunspot

*** ACCESS TO SUNSPOT IS NOT CURRENTLY ENABLED ***

Table of Contents:

  1. Overview
  2. Getting Help
  3. Logging into Sunspot
  4. Home and Project Directories
  5. Scheduling
  6. Data Transfer
  7. Proxy Settings
  8. Git SSH Protocol
  9. Programming Environment Setup
  10. GPU Validation Check
  11. MPI
    1. Aurora MPICH
    2. Cray MPI
  12. Kokkos

Overview

  • The Sunspot Test and Development System (TDS) consists of 2 racks, each with 64 nodes, for a total of 128 nodes
  • Each node consists of 2x Intel Xeon CPU Max Series (codename Sapphire Rapids or SPR) and 6x Intel Data Center GPU Max Series (codename Ponte Vecchio or PVC).
    • Each Xeon has 52 physical cores supporting 2 hardware threads per core
  • Interconnect is provided via 8x HPE Slingshot-11 NICs per node.

Sunspot is a Test and Development System and it is extremely early in the deployment of the system - do not expect a production environment ! 

Expect to experience:

  • Hardware instabilities – possible frequent downtimes
  • Software instabilities – non-optimized compilers, libraries, and tool; frequent software updates
  • Non-final configurations (e.g. storage, OS versions, etc.)
  • Short notice for downtimes (scheduled downtimes will be with 4 hr notice, but sometimes downtimes may occur with just an email notice). Notices go to the sunspot-notify@alcf.anl.gov email list. All users with access are added to the list initially.

Getting Help:

  • Email ALCF Support : support@alcf.anl.gov for bugs, technical questions, software requests, reservations, priority boosts, etc.
    • ALCF’s user support team will triage and forward the tickets to the appropriate technical SME as needed
    • Expect turnaround times to be slower than on a production system as the technical team will be focused on stabilizing and debugging the system
  • For faster assistance, consider contacting your project’s POC at ALCF (project catalyst or liaison)
    • They are an excellent source of assistance during this early period and will be aware of common bugs and known issues
  • ECP and ESP users will be added to a CNDA Slack workspace, where CNDA discussions may occur. An invite to the slack workspace will be sent when a user is added to the Sunspot resource.

Logging into Sunspot user access nodes

You will be able to access the system via SSH'ing to 'bastion.alcf.anl.gov'. This bastion is merely a pass-through erected for security purposes and is not meant to host files. Once on the bastion, SSH to 'sunspot.alcf.anl.gov'. It's round robin to the UANs (user access nodes).

Home and project directories

  1. Home mounted as /home, shared on uans and computes. Bastions have a different /home which is on Swift (shared with Polaris, Theta, Cooley). Default quota is 50 GB.

  2. Project directories are on /lus/gila/projects. ALCF staff should use /lus/gila/projects/Aurora_deployment project directory. ESP and ECP project members should use their corresponding project directories. Default quota is 1 TB.

Home and Project directories are on a Lustre file system called Gila.

Scheduling

Sunspot has PBSPro. For more information on using PBSPro for job scheduling, see PBSPro at ALCF. There are two production execution queues on Sunspot called "workq" and "diags". diags queue is a lower priority queue that will run jobs when there are no jobs queued in workq.


For example a one node, interactive job can be requested for 30 min with:

qsub -l select=1 -l walltime=30:00 -A Aurora_deployment -q workq -I

Queue Policies:

For workq:

  1. max job length: 2 hr 
  2. interactive jobs have a shell time out of 30 mins 
  3. max number of jobs queued: 1 running and 1 queued

There are no restrictions for the diags queue. It is a lower priority queue that will run jobs when there are no jobs queued in workq.

Data Transfer

Currently, scp is the only way to transfer data to/from sunspot.

Proxy Settings

export HTTP_PROXY=http://proxy.alcf.anl.gov:3128

export HTTPS_PROXY=http://proxy.alcf.anl.gov:3128

export http_proxy=http://proxy.alcf.anl.gov:3128

export https_proxy=http://proxy.alcf.anl.gov:3128

git config --global http.proxy http://proxy.alcf.anl.gov:3128

Git with SSH protocol

The default SSH port 22 is blocked on Sunspot; by default, this prevents communication with Git remotes that are SSH URLs such as:

git clone [user@]server:project.git

For a workaround for GitLab and GitHub, edit ~/.ssh/config to include:

Host github.com
     User git
     hostname ssh.github.com

Host gitlab.com
     User git
     hostname altssh.gitlab.com

Host github.com gitlab.com
     Port 443
     ProxyCommand /usr/bin/socat - PROXY:proxy.alcf.anl.gov:%h:%p,proxyport=3128

Your environment variable Proxy Settings must be set as above.

Programming Environment Setup

Loading Intel OneAPI SDK + Aurora optimized MPICH

The modules are located in /soft/modulefiles  and are setup by default in user path. if you do a module list  and dont see the oneapi module loaded, one can reset it to default by following the instructions below: 

uan-0001:~$ module purge

uan-0001:~$ module restore

Cray PE for GNU compilers, PALS, etc., are located in /opt/cray/pe/lmod/modulefiles/core. Module path should already be set in your user env.

 

If you would like to load explicitly the fabric/network stack after you modify the default SDK/UMD, please load append-deps/default at the end as,

 

uan-0001:~$ module load append-deps/default

Note, Cray-PALS modulefile should be loaded last as its important that the correct mpiexec from PALS is present as the default mpi. This can be confirmed with type -acommand as below

uan-0001:~$ type -a mpiexec

 

mpiexec is /opt/cray/pe/pals/1.2.4/bin/mpiexec

You can also use other modules (cmake, thapi/iprof) thanks to spack:

 

uan-0001:~$ module load spack 

uan-0001:~$ module load cmake thapi

GPU Validation Check

In some cases a workload might hang on the GPU, in such situations its possible to use the included gpu_check script (FLR in JLSE) thats setup when you load the runtime, to verify if all the GPUs are okay, kill any hung/running workloads on the GPU and if necessary reset the GPUs as well.

x1922c6s6b0n0:~$ gpu_check -rq 

Checking 6 GPUs  . . . . . . .

All 6 GPUs are okay!!!

MPI

Various ways to use MPI.

Aurora MPICH

Aurora MPICH is what will be the primary MPI on Aurora. It is jointly developed by Intel and Argonne. It allows GPU-aware communication.

You should have access to it with the default oneAPI module loaded. 

Use the associated compiler wrappers mpicxx, mpifort, mpicc, etc., as opposed to the Cray wrappers CC, ftn, cc. As always, the MPI compiler wrappers automatically link in MPI libraries when you use them to link your application.

Use mpiexec to invoke your binary, or a wrapper script around your. binary. You will generally need to use a wrapper script to control how MPI ranks are placed within and among GPUs. Variables set by the HPE PMIX system provide hooks to things like node counts and rank counts.

The following job script and wrapper script illustrate:

Example job script: jobscript.pbs

#!/bin/bash

#PBS -l select=32:system=sunspot,place=scatter

#PBS -A MyProjectAllocationName

#PBS -l walltime=01:00:00

#PBS -N 32NodeRunExample

#PBS -k doe

  

export TZ='/usr/share/zoneinfo/US/Central'

export OMP_PROC_BIND=spread

export OMP_NUM_THREADS=8

unset OMP_PLACES

  

cd /path/to/my/run/directory

  

echo Jobid: $PBS_JOBID

echo Running on host `hostname`

echo Running on nodes `cat $PBS_NODEFILE`

  

NNODES=`wc -l < $PBS_NODEFILE`

NRANKS=12          # Number of MPI ranks per node

NDEPTH=16          # Number of hardware threads per rank, spacing between MPI ranks on a node

NTHREADS=$OMP_NUM_THREADS # Number of OMP threads per rank, given to OMP_NUM_THREADS

  

NTOTRANKS=$(( NNODES * NRANKS ))

  

echo "NUM_NODES=${NNODES}  TOTAL_RANKS=${NTOTRANKS}  RANKS_PER_NODE=${NRANKS}  THREADS_PER_RANK=${OMP_NUM_THREADS}"

echo "OMP_PROC_BIND=$OMP_PROC_BIND OMP_PLACES=$OMP_PLACES"

  

mpiexec -np ${NTOTRANKS} -ppn ${NRANKS} -d ${NDEPTH} --cpu-bind depth -envall /home/zippy/bin/gpu_tile_compact.sh ./myBinaryName

Where gpu_tile_compact.sh  should be in your path and located in /soft/tools/mpi_wrapper_utils/gpu_tile_compact.sh. It will round-robin GPU tiles between ranks.

 

The example job script includes everything needed except the queue name, which will default accordingly. Invoke it using qsub

qsub jobscript.pbs

CrayMPI (WIP)

CrayMPI is the MPI provide by HPE which is a derivative of MPICH. It is optimized for Slingshot but provides no integration with Intel GPUs.

This is setup for CrayPE 22.10.

Check CPE Version

> ls -l /opt/cray/pe/cpe

total 0

drwxr-xr-x 2 root root 264 Jun  1 21:56 22.10

lrwxrwxrwx 1 root root   5 Jun  1 21:41 default -> 22.10

Building on UAN

Configure the modules to bring in support for CPE and expected PALS environment.

UAN Build

#If still using oneapi SDK

> module unload mpich

#Purge env if you want to use Cray PE GNU compilers

#module purge

> module load craype PrgEnv-gnu cray-pmi cray-pmi-lib craype-network-ofi craype-x86-spr craype/2.7.17 cray-pals/1.2.4 cray-libpals/1.2.4 cray-mpich

You can use the Cray HPE wrappers to compile MPI code that is CPU-only.

CPU-only compile/link

 

> cc -o test test.c

> ldd test | grep mpi

    libmpi_gnu_91.so.12 => /opt/cray/pe/lib64/libmpi_gnu_91.so.12 (0x00007ff2f3329000)

Building code that utilizes offload should use the Intel compiler suite otherwise linking with cc could result in SPIR-V code getting stripped from the binary.

Add the specific MPI compiler and linker flags to link within your Makefile and use the Intel compiler of choice.

Makefile

CXX=icpx

CMPIFLAGS=-I/opt/cray/pe/mpich/8.1.20/ofi/gnu/9.1/include -I/opt/cray/pe/pmi/6.1.6/include 

CXXOMPFLAGS=-fiopenmp -fopenmp-targets=spir64

CXXSYCLFLAGS=-fsycl -fsycl-targets=spir64

CMPILIBFLAGS=-D__TARGET_LINUX__ -L/opt/cray/pe/mpich/8.1.20/ofi/gnu/9.1/lib -L/opt/cray/pe/pmi/6.1.6/lib -Wl,--as-needed,-lmpi_gnu_91,--no-as-needed -Wl,--as-needed,-lpmi,--no-as-needed -Wl,--as-needed,-lpmi2

TARGETS=mpi-omp mpi-sycl


all: $(TARGETS)


mpi-omp.o: mpi-omp.cpp

    $(CXX) -c $(CXXOMPFLAGS) $(CMPIFLAGS) $^


mpi-sycl.o: mpi-sycl.cpp

    $(CXX) -c $(CXXSYCLFLAGS) $(CMPIFLAGS) $^


mpi-omp: mpi-omp.o

    $(CXX) -o $@ $^ $(CXXOMPFLAGS) $(CMPILIBFLAGS)


mpi-sycl: mpi-sycl.o

    $(CXX) -o $@ $^ $(CXXSYCLFLAGS) $(CMPILIBFLAGS)


clean::

    rm -f *.o $(TARGETS)

Expected output

Build Output

> make

icpx -c -fiopenmp -fopenmp-targets=spir64 -I/opt/cray/pe/mpich/8.1.20/ofi/gnu/9.1/include -I/opt/cray/pe/pmi/6.1.6/include  mpi-omp.cpp
icpx -o mpi-omp mpi-omp.o -fiopenmp -fopenmp-targets=spir64 -D__TARGET_LINUX__ -L/opt/cray/pe/mpich/8.1.20/ofi/gnu/9.1/lib -L/opt/cray/pe/pmi/6.1.6/lib -Wl,--as-needed,-lmpi_gnu_91,--no-as-needed -Wl,--as-needed,-lpmi,--no-as-needed -Wl,--as-needed,-lpmi2
icpx -c -fsycl -fsycl-targets=spir64 -I/opt/cray/pe/mpich/8.1.20/ofi/gnu/9.1/include -I/opt/cray/pe/pmi/6.1.6/include  mpi-sycl.cpp
icpx -o mpi-sycl mpi-sycl.o -fsycl -fsycl-targets=spir64 -D__TARGET_LINUX__ -L/opt/cray/pe/mpich/8.1.20/ofi/gnu/9.1/lib -L/opt/cray/pe/pmi/6.1.6/lib -Wl,--as-needed,-lmpi_gnu_91,--no-as-needed -Wl,--as-needed,-lpmi,--no-as-needed -Wl,--as-needed,-lpmi2

Running on Compute Nodes

The job script must also set the appropriate modules. It must also set the path to find the correct libpals as an older version gets picked up by default regardless of module selection.

run.sh

#!/bin/bash

#PBS -A Aurora_deployment

#PBS -q workq

#PBS -l select=1

#PBS -l walltime=10:00

#PBS -l filesystems=home


rpn=6

ranks=$((PBS_NODES * rpn))


#If still using oneapi SDK

module unload mpich

#Purge env if you want to use Cray PE GNU compilers

#module purge

module load craype PrgEnv-gnu cray-pmi cray-pmi-lib craype-network-ofi craype-x86-spr craype/2.7.17 cray-pals/1.2.4 cray-libpals/1.2.4 cray-mpich

module list


cd $PBS_O_WORKDIR


mpiexec -n $ranks -ppn $rpn ./mpi-omp

Submit the job from the UAN

Job submission

> qsub ./run.sh

1123.amn-0001

Output from the test cases

OMP Output

> mpiexec -n 6 -ppn 6 ./mpi-omp

hi from device 2 and rank 2

hi from device 0 and rank 0

hi from device 3 and rank 3

hi from device 4 and rank 4

hi from device 1 and rank 1

hi from device 5 and rank 5

SYCL Output

> > mpiexec -n 6 -ppn 6 ./mpi-sycl

World size: 6

Running on Intel(R) Graphics [0x0bd6]

Hello, World from 4 ! 

Running on Intel(R) Graphics [0x0bd6]

Hello, World from 3 ! 

Running on Intel(R) Graphics [0x0bd6]

Hello, World from 0 ! 

Running on Intel(R) Graphics [0x0bd6]

Hello, World from 1 ! 

Running on Intel(R) Graphics [0x0bd6]

Hello, World from 2 ! 

Running on Intel(R) Graphics [0x0bd6]

Hello, World from 5 !

The programs used to generate these outputs are mpi-omp.cpp and mpi-sycl.cpp.

 

mpi-omp.cpp

#include <mpi.h>

#include <omp.h>

#include <stdio.h>


int main(int argc, char** argv) {

  // Initialize the MPI environment

  MPI_Init(NULL, NULL);

  // Get the rank of the process

  int world_rank;

  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);


#pragma omp target device( world_rank % omp_get_num_devices())

  {

    printf( "hi from device %d and rank %d\n", omp_get_device_num(), world_rank );

  }


  // Finalize the MPI environment.

  MPI_Finalize();

}

mpi-sycl.cpp

#include <mpi.h>

#include <sycl/sycl.hpp>

#include <stdio.h>

#include <string.h>


int main(int argc, char** argv) {

  // Initialize the MPI environment

  MPI_Init(NULL, NULL);

  // Get the rank of the process

  int world_rank;

  int world_size;

  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

  MPI_Comm_size(MPI_COMM_WORLD, &world_size);

   

  char zemask[256];

  snprintf(zemask, sizeof(zemask), "ZE_AFFINITY_MASK=%d", world_rank % 6);

  putenv(zemask);


  if (world_rank == 0) std::cout << "World size: " << world_size << std::endl;


  sycl::queue Q(sycl::gpu_selector{});


  std::cout << "Running on "

            << Q.get_device().get_info<sycl::info::device::name>()

            << "\n";


  Q.submit([&](sycl::handler &cgh) {

    // Create a output stream

    sycl::stream sout(1024, 256, cgh);

    // Submit a unique task, using a lambda

    cgh.single_task([=]() {

      sout << "Hello, World from "  << world_rank << " ! " << sycl::endl;

    }); // End of the kernel function

  });   // End of the queue commands. The kernel is now submited

  Q.wait();

 

  // Finalize the MPI environment.

  MPI_Finalize();

}

Kokkos

There is one central build of kokkos in place now, with \{Serial,OpenMP,SYCL\} execution spaces, with AoT for PVC.

module use /soft/modulefiles

module load kokkos

will load it. If you're using cmake to build your Kokkos app, it's the usual drill. Otherwise, loading this module will set the KOKKOS_HOME environment variable, which you can use in Makefiles etc. to find include files and libraries.