*** ACCESS TO SUNSPOT IS ENABLED FOR ESP AND ECP TEAMS ONLY ***
Overview
- The Sunspot Test and Development System (TDS) consists of 2 racks, each with 64 nodes, for a total of 128 nodes
- Each node consists of 2x Intel Xeon CPU Max Series (codename Sapphire Rapids or SPR) and 6x Intel Data Center GPU Max Series (codename Ponte Vecchio or PVC).
- Each Xeon has 52 physical cores supporting 2 hardware threads per core
- Interconnect is provided via 8x HPE Slingshot-11 NICs per node.
Sharing of any results from Sunspot publicly no longer requires a review or approval from Intel. However, anyone publishing these results should include the following in their materials: "This work was done on a pre-production supercomputer with early versions of the Aurora software development kit." In addition, users should acknowledge the ALCF. Refer to the acknowledgement policy page for details : https://docs.alcf.anl.gov/policies/alcf-acknowledgement-policy/#alcf-only-acknowledgement. Please note that certain information on Sunspot hardware and software is considered NDA and cannot be shared publicly.
Sunspot is a Test and Development System and it is extremely early in the deployment of the system - do not expect a production environment !
Expect to experience:
- Hardware instabilities – possible frequent downtimes
- Software instabilities – non-optimized compilers, libraries, and tool; frequent software updates
- Non-final configurations (e.g. storage, OS versions, etc.)
- Short notice for downtimes (scheduled downtimes will be with 4 hr notice, but sometimes downtimes may occur with just an email notice). Notices go to the sunspot-notify@alcf.anl.gov email list. All users with access are added to the list initially.
Prerequisites for Access to Sunspot/Aurora
*** ACCESS TO SUNSPOT (and AURORA) IS ENABLED FOR ESP AND ECP TEAMS ONLY ***
ECP:
Exascale Computing Project (ECP) team members must:
Request Aurora early hardware/software access through ECP by filling out the Jira* form: https://jira.exascaleproject.org/servicedesk/customer/portal/10/create/254. If you have already put in a request and it was not rejected or you did not change institutions, please skip this step as you do not need to put in a 2nd request. Note that access to the ECP Atlassian/Jira tool ends for users ends on December 31, 2023. After December 31st, the ECP project office will no longer accept Sunspot account requests. All requests must be submitted before December 31, 2023.
If you don’t have an ECP Atlassian/Jira account, follow the steps below. Questions regarding ECP Jira account or access should be emailed to ecp-support@exascaleproject.org. Proceed to step 2 once you have submitted the ECP Jira form.
- Ask your PI or his/her representative to complete the onboard form https://jira.exascaleproject.org/servicedesk/customer/portal/20/create/189 and be sure they select “Jira Project” in the tools access list (Optional to also select “Confluence”).
- Once submitted, notifications are sent to initiate the ECP Atlassian account creation process. PI approval and PAS (personnel access system) approval must be completed before the account is created. PAS processing for foreign nationals can take 7-10 days or more after receipt of required materials.
- Requestor will be notified when the ECP Atlassian account is created.
- Please read and acknowledge the latest Terms of Use by filling out the form below. You are responsible for ensuring you are authorized by your institution to read and acknowledge the TOU: https://events.cels.anl.gov/event/147/surveys/7.
- Have an active ALCF account and be a member of all the appropriate ECP project on Polaris.
- Request for an account if none: https://accounts.alcf.anl.gov/#/accountRequest. Search for your project(s) by the WBS number (for ECP) or name with the right PI. Do not choose projects ending in _CNDA.
- Re-activate if your account is inactive: https://accounts.alcf.anl.gov/#/accountReactivate. Search for your project by the WBS (for ECP) number or name with the right PI. Do not choose projects ending in _CNDA.
- If you have an active account but you are not on all the ESP/ECP projects on Theta/Polaris, request to join the projects that are missing: https://accounts.alcf.anl.gov/#/joinProject. Search for your project by the WBS number (for ECP) or name with the right PI. Do not choose projects ending in _CNDA.
Team members that satisfy all the pre-requisites listed above should then email support@alcf.anl.gov requesting access to Sunspot/Aurora.
ESP:
Refer to this page for instructions: https://docs.alcf.anl.gov/aurora/getting-started-on-aurora/#for-aurora-early-science-program-esp-team-members
Getting Help:
- Email ALCF Support : support@alcf.anl.gov for bugs, technical questions, software requests, reservations, priority boosts, etc.
- ALCF’s user support team will triage and forward the tickets to the appropriate technical SME as needed
- Expect turnaround times to be slower than on a production system as the technical team will be focused on stabilizing and debugging the system
- For faster assistance, consider contacting your project’s POC at ALCF (project catalyst or liaison)
- They are an excellent source of assistance during this early period and will be aware of common bugs and known issues
- ECP and ESP users will be added to a CNDA Slack workspace, where CNDA discussions may occur. An invite to the slack workspace will be sent when a user is added to the Sunspot resource.
Known Issues
A known issues page can be found in the JLSE Wiki space used for NDA content. Note that this page requires JLSE Aurora early hw/sw resource account for access : https://wiki.jlse.anl.gov/display/inteldga/Known+Issues
Logging into Sunspot user access nodes
You will be able to access the system via SSH'ing to 'bastion.alcf.anl.gov'. This bastion is merely a pass-through erected for security purposes and is not meant to host files. Once on the bastion, SSH to 'sunspot.alcf.anl.gov'. It is round robin to the UANs (user access nodes). To use proxyjump, see the DataTransfer section below.
Note that Sunspot uses ALCF credentials (same as Polaris and https://accounts.alcf.anl.gov ) and not JLSE credentials.
Home and project directories
- Home mounted as /home, shared on uans and computes. Bastions have a different /home which is on Swift (shared with Polaris, Theta, Cooley). Default quota is 50 GB.
- Project directories are on /lus/gila/projects
- ALCF staff should use /lus/gila/projects/Aurora_deployment project directory. ESP and ECP project members should use their corresponding project directories. The project name is similar to the name on Theta/Polaris with an _CNDA suffix (for eg: projectA_aesp_CNDA, CSC250ADABC_CNDA). Default quota is 1 TB. The project PI should email support@alcf.anl.gov if their project requires additional storage.
Home and Project directories are on a Lustre file system called Gila.
Quotas
Default home quota is 50 GB. Use this command to view your home directory quota usage:
soft/tools/alcf_quota/bin/myquota
Default quota for the project directories is 1 TB. The project PI should email support@alcf.anl.gov if their project requires additional storage. Use this command to check your project quota usage:
/soft/tools/alcf_quota/bin/myprojectquotas
Note that due to high utilization of the filesystem (Gila), we are enforcing user quotas of 50GB per user on Gila. This means the combination of file sizes across all directories on Gila owned by a user is capped at 50GB. Use the following command to check your Gila user quota (replace <username> with your ALCF username):
lfs quota -h -u <username> /lus/gila
Scheduling
Sunspot has PBSPro. For more information on using PBSPro for job scheduling, see PBSPro at ALCF.
There are two production execution queues "workq" and "diag" and one debug queue called "debug" on Sunspot. In addition, there is a routing queue called "workq-route", that can be used to hold multiple jobs which get routed to workq. Note that users can submit jobs to workq directly and do not have to use the routing queue (workq-route) if they don't need to.
- diags queue is a lower priority queue, intended for operational diagnostics, that will run jobs when there are no jobs queued in workq. Access to the diag queue is restricted. Email support@alcf.anl.gov if you have a need to use this queue and provide a write-up of your use-case.
For example a one node, interactive job on workq can be requested for 30 min with:
qsub -l select=1 -l walltime=30:00 -A Aurora_deployment -q workq -I
Queue Policies:
For workq queue:
- max job length: 2 hr
- max job size : 128 - (nodes that are down) - (nodes that have broken/validation flags set on them [currently 4]) - (4 debug nodes)
- interactive jobs have a shell time out of 30 mins which will cause idle interactive shells to exit if idle for more than 30 minutes
- max number of jobs: 1 running and 1 queued
For workq-route queue (routing queue):
- max job length: 2 hr
- max job size : 128 - (nodes that are down) - (nodes that have broken/validation flags set on them [currently 4]) - (4 debug nodes)
- max number of jobs queued: 30
For diag queue:
There are no restrictions for the diag queue. It is a lower priority queue that will run jobs when there are no jobs queued in workq or debug queues, and intended for operational diagnostics. Access to diags queue is restricted. Email support@alcf.anl.gov if you have a need to use this queue and provide a write-up of your use-case.
For debug queue:
- max job length: 1 hr
- max job size : 1 node (a total of 4 nodes are reserved for the debug queue)
- interactive jobs have a shell time out of 30 mins which will cause idle interactive shells to exit if idle for more than 30 minutes
- max number of jobs : 1 running and 1 queue
Submission Options:
For jobs running in the production queues, the follow default settings will be applied unless otherwise changed by the user:
hbm_mode=flat
numa_mode=quad
The following is an example of how to specify flat mode:
-l select=16:ncpus=208:hbm_mode=flat
Too submit a full job flat if you have multiple chunks your select statement will need to be along the line of this example to be applied to each chunk:
-l select=1:vnode=x1921c0s0b0n0:hbm=flat+1:vnode=1921c1s0b0n0:hbm_mode=flat
Allocation usage
The allocation accounting system sbank, is installed on sunspot.
- To obtain the usage information for all your projects, issue the sbank command on sunspot: sbank-list-allocations.
For more information, see this page: https://docs.alcf.anl.gov/account-project-management/allocation-management/allocation-management/
Data Transfer
Currently, scp and SFTP are the only ways to transfer data to/from Sunspot.
As an expedient for initiating ssh sessions to sunspot login nodes via the bastion indirect nodes, and to enable scp from remote hosts to sunspot login nodes, follow these steps:
- Create SSH keys on the laptop/desktop/remote machine. See "Creating SSH Keys" section on this page.
- Add the lines listed below to your
~/.ssh/config
file on the remote host. That is, you should do this on your laptop/desktop, from which you are initiating ssh login sessions to Sunspot via bastion, and on other non-ALCF host systems from which you want to copy files to Sunspot login nodes usingscp
. Replaceid_rsa
with the name of your own private ssh key file. Copy the public key (*.pub) from ~/.ssh folder on the remote machine to ~/.ssh/authorized_keys file on Sunspot (not the bastion).
When you use an SSH proxy, it takes the authentication mechanism from the local host and applies it to the farthest-remote host, while prompting you for the “middle host” separately. So, when you run the ssh@sunspot.alcf.anl.gov command on your laptop/desktop, you'll be prompted for two ALCF authentication codes - first the Mobilepass+ or Cryptocard passcode for the bastion, and then the SSH passphrase for Sunspot. Likewise, when you run scp from a remote host to copy files to Sunspot login nodes, you'll be prompted for two ALCF authentication codes codes - first the Mobilepass+ or Cryptocard passcode and then the SSH passphrase.
File: ~/.ssh/config
Host bastion.alcf.anl.gov user <your_ALCF_username>
Host *.sunspot.alcf.anl.gov sunspot.alcf.anl.gov ProxyJump bastion.alcf.anl.gov DynamicForward 3142 IdentityFile ~/.ssh/id_rsa user <your_ALCF_username>
Proxy Settings
export HTTP_PROXY=http://proxy.alcf.anl.gov:3128 export HTTPS_PROXY=http://proxy.alcf.anl.gov:3128 export http_proxy=http://proxy.alcf.anl.gov:3128 export https_proxy=http://proxy.alcf.anl.gov:3128 git config --global http.proxy http://proxy.alcf.anl.gov:3128
Git with SSH protocol
The default SSH port 22 is blocked on Sunspot; by default, this prevents communication with Git remotes that are SSH URLs such as:
git clone [user@]server:project.git
For a workaround for GitLab, GitHub, and Bitbucket, edit ~/.ssh/config to include:
Host github.com User git hostname ssh.github.com Host gitlab.com User git hostname altssh.gitlab.com Host bitbucket.org User git hostname altssh.bitbucket.org Host github.com gitlab.com bitbucket.org bitbucket.org Port 443 ProxyCommand /usr/bin/socat - PROXY:proxy.alcf.anl.gov:%h:%p,proxyport=3128
Your environment variable Proxy Settings must be set as above.
Using Non-Default SSH Key for GitHub
If you need to use something besides your default SSH key on sunspot for authentication to GitHub in conjunction with the SSH workaround, you may set
export GIT_SSH_COMMAND="ssh -i ~/.ssh/specialGitKey -F /dev/null"
where specialGitKey
is the name of the private key in your .ssh
subdirectory, for which you have uploaded the public key to GitHub.
Programming Environment Setup
Loading Intel OneAPI SDK + Aurora optimized MPICH
The modules are located in /soft/modulefiles
and are setup by default in user path. The default set of modules is deliberately kept to a minimum on Sunspot.
If you do a module list
and don't see the oneapi module loaded, you can reset it to default by following the instructions below:
uan-0001:~$ module purge uan-0001:~$ module restore
Cray PE for GNU compilers, PALS, etc., are located in /opt/cray/pe/lmod/modulefiles/core
.
Add the following ahead of any module load commands:
module use /opt/cray/pe/lmod/modulefiles/core /opt/cray/pe/lmod/modulefiles/craype-targets/default
If you would like to load explicitly the fabric/network stack after you modify the default SDK/UMD, please load append-deps/default
at the end as,
uan-0001:~$ module load append-deps/default
Note, Cray-PALS modulefile should be loaded last as its important that the correct mpiexec from PALS is present as the default mpi. This can be confirmed with type -a
command as below
uan-0001:~$ type -a mpiexec
mpiexec is /opt/cray/pe/pals/1.2.4/bin/mpiexec
You can also use other modules thanks to spack (see Spack and E4S for details).
Note that the default set of modules is deliberately kept to a minimum on Sunspot.
For example, for cmake:
uan-0001:~$ module load spack
uan-0001:~$ module load cmake
For iprof (the module name is THAPI) :
uan-0001:~$ module load spack thapi
OpenMP Stack Size on the CPU
This is a note that the default stack size per CPU OpenMP thread with Intel OpenMP is 4MB. ( https://www.intel.com/content/www/us/en/docs/dpcpp-cpp-compiler/developer-guide-reference/2023-0/supported-environment-variables.html ). It can also be queried at runtime by running with OMP_DISPLAY_ENV=T set. If you see a segfault in a code which uses OpenMP CPU threads, you can try increasing the value in this environment variable.
GPU Validation Check
In some cases a workload might hang on the GPU, in such situations its possible to use the included gpu_check script (FLR in JLSE) thats setup when you load the runtime, to verify if all the GPUs are okay, kill any hung/running workloads on the GPU and if necessary reset the GPUs as well.
x1922c6s6b0n0:~$ gpu_check -rq Checking 6 GPUs . . . . . . . All 6 GPUs are okay!!!
MPI
Various ways to use MPI.
Aurora MPICH
Aurora MPICH is what will be the primary MPI on Aurora. It is jointly developed by Intel and Argonne. It allows GPU-aware communication.
You should have access to it with the default oneAPI module loaded.
Use the associated compiler wrappers mpicxx, mpifort, mpicc
, etc., as opposed to the Cray wrappers CC, ftn, cc
. As always, the MPI compiler wrappers automatically link in MPI libraries when you use them to link your application.
Use mpiexec to invoke your binary, or a wrapper script around your. binary. You will generally need to use a wrapper script to control how MPI ranks are placed within and among GPUs. Variables set by the HPE PMIX system provide hooks to things like node counts and rank counts.
The following job script and wrapper script illustrate:
Example job script: jobscript.pbs
#!/bin/bash #PBS -l select=32:system=sunspot,place=scatter #PBS -A MyProjectAllocationName #PBS -l walltime=01:00:00 #PBS -N 32NodeRunExample #PBS -k doe export TZ='/usr/share/zoneinfo/US/Central' export OMP_PROC_BIND=spread export OMP_NUM_THREADS=8 unset OMP_PLACES cd /path/to/my/run/directory echo Jobid: $PBS_JOBID echo Running on host `hostname` echo Running on nodes `cat $PBS_NODEFILE` NNODES=`wc -l < $PBS_NODEFILE` NRANKS=12 # Number of MPI ranks per node NDEPTH=16 # Number of hardware threads per rank, spacing between MPI ranks on a node NTHREADS=$OMP_NUM_THREADS # Number of OMP threads per rank, given to OMP_NUM_THREADS NTOTRANKS=$(( NNODES * NRANKS )) echo "NUM_NODES=${NNODES} TOTAL_RANKS=${NTOTRANKS} RANKS_PER_NODE=${NRANKS} THREADS_PER_RANK=${OMP_NUM_THREADS}" echo "OMP_PROC_BIND=$OMP_PROC_BIND OMP_PLACES=$OMP_PLACES" mpiexec -np ${NTOTRANKS} -ppn ${NRANKS} -d ${NDEPTH} --cpu-bind depth -envall gpu_tile_compact.sh ./myBinaryName
Where gpu_tile_compact.sh
should be in your path and located in /soft/tools/mpi_wrapper_utils/gpu_tile_compact.sh
. It will round-robin GPU tiles between ranks.
The example job script includes everything needed except the queue name, which will default accordingly. Invoke it using qsub
qsub jobscript.pbs
CrayMPI (WIP)
CrayMPI is the MPI provide by HPE which is a derivative of MPICH. It is optimized for Slingshot but provides no integration with Intel GPUs.
This is setup for CrayPE 22.10.
Check CPE Version
> ls -l /opt/cray/pe/cpe total 0 drwxr-xr-x 2 root root 264 Jun 1 21:56 22.10 lrwxrwxrwx 1 root root 5 Jun 1 21:41 default -> 22.10
Building on UAN
Configure the modules to bring in support for CPE and expected PALS environment.
UAN Build
#If still using oneapi SDK > module unload mpich #Purge env if you want to use Cray PE GNU compilers #module purge > module load craype PrgEnv-gnu cray-pmi craype-network-ofi craype-x86-spr craype/2.7.17 cray-pals/1.2.9 cray-libpals/1.2.9 cray-mpich
You can use the Cray HPE wrappers to compile MPI code that is CPU-only.
CPU-only compile/link
> cc -o test test.c > ldd test | grep mpi libmpi_gnu_91.so.12 => /opt/cray/pe/lib64/libmpi_gnu_91.so.12 (0x00007ff2f3329000)
Building code that utilizes offload should use the Intel compiler suite otherwise linking with cc could result in SPIR-V code getting stripped from the binary.
Add the specific MPI compiler and linker flags to link within your Makefile and use the Intel compiler of choice.
Makefile
CXX=icpx CMPIFLAGS=-I/opt/cray/pe/mpich/8.1.20/ofi/gnu/9.1/include -I/opt/cray/pe/pmi/6.1.6/include CXXOMPFLAGS=-fiopenmp -fopenmp-targets=spir64 CXXSYCLFLAGS=-fsycl -fsycl-targets=spir64 CMPILIBFLAGS=-D__TARGET_LINUX__ -L/opt/cray/pe/mpich/8.1.20/ofi/gnu/9.1/lib -L/opt/cray/pe/pmi/6.1.6/lib -Wl,--as-needed,-lmpi_gnu_91,--no-as-needed -Wl,--as-needed,-lpmi,--no-as-needed -Wl,--as-needed,-lpmi2 TARGETS=mpi-omp mpi-sycl all: $(TARGETS) mpi-omp.o: mpi-omp.cpp $(CXX) -c $(CXXOMPFLAGS) $(CMPIFLAGS) $^ mpi-sycl.o: mpi-sycl.cpp $(CXX) -c $(CXXSYCLFLAGS) $(CMPIFLAGS) $^ mpi-omp: mpi-omp.o $(CXX) -o $@ $^ $(CXXOMPFLAGS) $(CMPILIBFLAGS) mpi-sycl: mpi-sycl.o $(CXX) -o $@ $^ $(CXXSYCLFLAGS) $(CMPILIBFLAGS) clean:: rm -f *.o $(TARGETS)
Expected output
Build Output
> make icpx -c -fiopenmp -fopenmp-targets=spir64 -I/opt/cray/pe/mpich/8.1.20/ofi/gnu/9.1/include -I/opt/cray/pe/pmi/6.1.6/include mpi-omp.cpp icpx -o mpi-omp mpi-omp.o -fiopenmp -fopenmp-targets=spir64 -D__TARGET_LINUX__ -L/opt/cray/pe/mpich/8.1.20/ofi/gnu/9.1/lib -L/opt/cray/pe/pmi/6.1.6/lib -Wl,--as-needed,-lmpi_gnu_91,--no-as-needed -Wl,--as-needed,-lpmi,--no-as-needed -Wl,--as-needed,-lpmi2 icpx -c -fsycl -fsycl-targets=spir64 -I/opt/cray/pe/mpich/8.1.20/ofi/gnu/9.1/include -I/opt/cray/pe/pmi/6.1.6/include mpi-sycl.cpp icpx -o mpi-sycl mpi-sycl.o -fsycl -fsycl-targets=spir64 -D__TARGET_LINUX__ -L/opt/cray/pe/mpich/8.1.20/ofi/gnu/9.1/lib -L/opt/cray/pe/pmi/6.1.6/lib -Wl,--as-needed,-lmpi_gnu_91,--no-as-needed -Wl,--as-needed,-lpmi,--no-as-needed -Wl,--as-needed,-lpmi2
Running on Compute Nodes
The job script must also set the appropriate modules. It must also set the path to find the correct libpals as an older version gets picked up by default regardless of module selection.
run.sh
#!/bin/bash #PBS -A Aurora_deployment #PBS -q workq #PBS -l select=1 #PBS -l walltime=10:00 #PBS -l filesystems=home rpn=6 ranks=$((PBS_NODES * rpn)) #If still using oneapi SDK module unload mpich #Purge env if you want to use Cray PE GNU compilers #module purge module load craype PrgEnv-gnu cray-pmi cray-pmi-lib craype-network-ofi craype-x86-spr craype/2.7.17 cray-pals/1.2.4 cray-libpals/1.2.4 cray-mpich module list cd $PBS_O_WORKDIR mpiexec -n $ranks -ppn $rpn ./mpi-omp
Submit the job from the UAN
Job submission
> qsub ./run.sh 1123.amn-0001
Output from the test cases
OMP Output
> mpiexec -n 6 -ppn 6 ./mpi-omp hi from device 2 and rank 2 hi from device 0 and rank 0 hi from device 3 and rank 3 hi from device 4 and rank 4 hi from device 1 and rank 1 hi from device 5 and rank 5
SYCL Output
> > mpiexec -n 6 -ppn 6 ./mpi-sycl World size: 6 Running on Intel(R) Graphics [0x0bd6] Hello, World from 4 ! Running on Intel(R) Graphics [0x0bd6] Hello, World from 3 ! Running on Intel(R) Graphics [0x0bd6] Hello, World from 0 ! Running on Intel(R) Graphics [0x0bd6] Hello, World from 1 ! Running on Intel(R) Graphics [0x0bd6] Hello, World from 2 ! Running on Intel(R) Graphics [0x0bd6] Hello, World from 5 !
The programs used to generate these outputs are mpi-omp.cpp and mpi-sycl.cpp.
mpi-omp.cpp
#include <mpi.h> #include <omp.h> #include <stdio.h> int main(int argc, char** argv) { // Initialize the MPI environment MPI_Init(NULL, NULL); // Get the rank of the process int world_rank; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); #pragma omp target device( world_rank % omp_get_num_devices()) { printf( "hi from device %d and rank %d\n", omp_get_device_num(), world_rank ); } // Finalize the MPI environment. module load daos/base
MPI_Finalize(); }
mpi-sycl.cpp
#include <mpi.h> #include <sycl/sycl.hpp> #include <stdio.h> #include <string.h> int main(int argc, char** argv) { // Initialize the MPI environment MPI_Init(NULL, NULL); // Get the rank of the process int world_rank; int world_size; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); MPI_Comm_size(MPI_COMM_WORLD, &world_size); char zemask[256]; snprintf(zemask, sizeof(zemask), "ZE_AFFINITY_MASK=%d", world_rank % 6); putenv(zemask); if (world_rank == 0) std::cout << "World size: " << world_size << std::endl; sycl::queue Q(sycl::gpu_selector{}); std::cout << "Running on " << Q.get_device().get_info<sycl::info::device::name>() << "\n"; Q.submit([&](sycl::handler &cgh) { // Create a output stream sycl::stream sout(1024, 256, cgh); // Submit a unique task, using a lambda cgh.single_task([=]() { sout << "Hello, World from " << world_rank << " ! " << sycl::endl; }); // End of the kernel function }); // End of the queue commands. The kernel is now submited Q.wait();
// Finalize the MPI environment. MPI_Finalize(); }
Kokkos
There is one central build of kokkos in place now, with \{Serial,OpenMP,SYCL\} execution spaces, with AoT for PVC.
module use /soft/modulefiles module load kokkos
will load it. If you're using cmake
to build your Kokkos app, it's the usual drill (note that cmake
is available via module load spack cmake)
. Otherwise, loading this module will set the KOKKOS_HOME
environment variable, which you can use in Makefiles etc. to find include files and libraries.
Debugging Applications
Running gdb-oneapi in batch mode
In batch mode, gdb-oneapi can attach to each MPI ranks to obtain stack traces. The standard output and error can go to individual files distinguished by environment variables PBS_JOBID
and PALS_RANKID
. The example command below uses mpiexec
to launch bash
to access the environment variables of each MPI rank, and redirects their outputs. The bash process calls gdb-oneapi
, which launches ./the_executable
with optional arguments. The gdb commands "run
" and "thread apply all bt
" runs the executable and prints out backtrace when the application receives erroneous signals. More gdb commands can go in with each prefixed by "-ex
", such as setting break points or extra signal handlers. Note that the command below follows the Bourne shell's quoting rule, such that the whole gdb-oneapi ...
command is in single quotes, and the environment variables only get interpreted by the bash
process launched by mpiexec
.
mpiexec [mpiexec_args ...] bash -c ' gdb-oneapi -batch -ex run -ex "thread apply all bt" --args ./the_executable [executable_args ...] >out.${PBS_JOBID%.*}.$PALS_RANKID 2>err.${PBS_JOBID%.*}.$PALS_RANKID'
Conda
source $IDPROOT/etc/profile.d/conda.sh
Spack and E4S
Spack is a package manager used to manage HPC software environments.
The Extreme-Scale Scientific Software Stack (E4S) is a project of ECP which provides an open-source scientific software stack.
The ALCF provides Spack-managed software on Sunspot via modules, including E4S deployments.
Using Spack packages
Currently, three Spack metamodules are available: spack/linux-sles15-x86_64-ldpath
, e4s/22.08
, and e4s/22.11
. Loading a metamodule will make additional software modules available:
uan-0001:~$ module load spack uan-0001:~$ module avail --------------------- /soft/packaging/spack/gnu-ldpath/modules/linux-sles15-x86_64 ---------------------- autoconf/2.69-gcc-11.2.0-mfogo75 ninja/1.11.1-gcc-11.2.0-6biwuw5 autoconf/2.71-gcc-11.2.0-ofpl6wv (D) numactl/2.0.14-gcc-11.2.0-nzqw57c automake/1.15.1-gcc-11.2.0-2kuz3tx openssl/1.1.1d-gcc-11.2.0-amlvxob babeltrace2/2.0.4-gcc-11.2.0-xfjn3pn patchelf/0.17.0-gcc-11.2.0-rsf5nuy bzip2/1.0.6-gcc-11.2.0-gs35ttl perl/5.26.1-gcc-11.2.0-pqmes6b cmake/3.24.2-gcc-11.2.0-pcasswq pkg-config/0.29.2-gcc-11.2.0-cchn55a ... uan-0001:~$ module load cmake uan-0001:~$ which cmake /soft/packaging/spack/gnu-ldpath/build/linux-sles15-x86_64/gcc-11.2.0/cmake-3.24.2-pcasswqhzb3tyew7ujqyxxvvwdsvnyqd/bin/cmake
The spack
module loads basic libraries and utilities, while the e4s
modules load more specialized scientific software. Packages in the e4s
modules are not optimized when installed; however, individual package builds may be customized by request to provide alternative variants or improve performance.
The available packages for each Spack deployment are listed at the bottom of this page.
Using Spack to build packages
You may find Spack useful for your own software builds, particularly if there is a large dependency tree associated with your software. In order to do so, you will need to install a user instance of Spack. We recommend using the latest develop branch of Spack since it includes some necessary patches for Sunspot's environment. See Spack's Getting Started Guide for installation details.
You can also copy the Spack configuration files used for the E4S deployment - this may simplify the process of using the OneAPI compilers as well as any external libraries and dependencies. Copy the files in /soft/packaging/spack/settings/
into your spack installation at $spack/etc/spack
to apply the configurations to all environments using your Spack instance.
Package lists
--------------------- /soft/packaging/spack/gnu-ldpath/modules/linux-sles15-x86_64 ----------------------
-------------------------------- /soft/packaging/spack/e4s/22.11/modules --------------------------------
VTune
Please refer to the JLSE testbed VTune documentation (Note that this page requires JLSE Aurora early hw/sw resource account for access).
Because of the two-step process to login to the Sunspot login nodes, going first through the bastion nodes, the instructions for VTune Profiler as a Web Server should be augmented: On your desktop/laptop, where you initiate the ssh session for port forwarding to the vtune gui backend you have started on sunspot, you should make this addition into your ~/.ssh/config file:
Insert this into your ~/.ssh/config file
Host *.sunspot.alcf.anl.gov sunspot.alcf.anl.gov ProxyJump bastion.alcf.anl.gov DynamicForward 3142 IdentityFile ~/.ssh/id_rsa
where you replace id_rsa
with the name of your own private ssh key file. When you run the port-forwarding ssh command on your laptop/desktop, you'll be prompted for two ALCF authentication one-time passwords - one for bastion, and the other for the sunspot login node.
DAOS
Users should submit a request as noted below to have their DAOS pool created. Once created, users may create and manage containers within the pool as they wish. As this time, we ask users to avoid creating data using erasure encoding data protection. The current release of DAOS has an issue during rebuild of EC protected data. This will be resolved in the next DAOS release.
Note: When DAOS is upgraded to 2.4, the system will be reformatted which will lead to data loss. Any critical data should be backed up to $HOME. Notification will be provided before the update happens.
Using DAOS:
Your pool will be named by the short name of your project. You will have permissions to create and manage containers within the pool.
- Request a storage allocation between 1 to 50TB for your project by emailing support@alcf.anl.gov with the following information:
- Sunspot DAOS Pool
- Username for owner
- Unix group for read/write access
- Storage capacity
- Sunspot DAOS Pool
- Load the daos/base module. (This should be a default module)
module load daos/base module list
- Confirm access to pool
Pool Example
daos pool query <pool name> harms@uan-0002:~> daos pool query software Pool 050b20a3-3fcc-499b-a6cf-07d4b80b04fd, ntarget=640, disabled=0, leader=2, version=131 Pool space info: - Target(VOS) count:640 - Storage tier 0 (SCM): Total size: 6.0 TB Free: 4.4 TB, min:6.5 GB, max:7.0 GB, mean:6.9 GB - Storage tier 1 (NVMe): Total size: 200 TB Free: 194 TB, min:244 GB, max:308 GB, mean:303 GB Rebuild done, 4 objs, 0 recs
- Create a container
The container is your basic unit of storage. A POSIX container can contain 100s of millions of files, you can use it to store all of y our date. You only need a small set of containers perhaps just one per major unit of project work.
Container Example
mkcont --type POSIX --pool <pool name> --user $USER --group <group> <container name> harms@uan-0002:~> mkcont --type=POSIX --pool iotest --user harms --group users random Container UUID : 9a6989d3-3835-4521-b9c6-ba1b10f3ec9c Container Label: random Container Type : POSIX Successfully created container 9a6989d3-3835-4521-b9c6-ba1b10f3ec9c 0
- Mount the container
Currently, you must manually mount your container prior to use on any node you are working on.
For the UAN, mount it at a convenient mount point using the default dfuse parameters. This enables full caching or both metadata and data for best interactive performance.
Mount Example
dfuse --pool=<pool name> --cont=<cont name> -m $HOME/daos/<pool>/<cont> mkdir -p $HOME/daos/iotest/random dfuse --pool=iotest --cont=random -m $HOME/daos/iotest/random harms@uan-0002:~> mount | grep iotest dfuse on /home/harms/daos/iotest/random type fuse.daos (rw,nosuid,nodev,noatime,user_id=4211,group_id=100,default_permissions)
From a compute node (CN), you need to mount the container on all compute nodes. We provide some scripts to help perform this from within your job script.
More examples are available in /soft/daos/examples. The following examples uses two support scripts to startup dfuse on each compute node and then shut it down at job end.
Job Submission
qsub -v DAOS_POOL=<name>,DAOS_CONT=<name> ./job-script.sh
Job Script Example
#!/bin/bash #PBS -A <project> #PBS -lselect=1 #PBS -lwalltime=30:00 #PBS -k doe # # Test case for MPI-IO code example # ranks per node rpn=4 # threads per rank threads=1 # nodes per job nnodes=$(cat $PBS_NODEFILE | wc -l) # Verify the pool and container are set if [ -z "$DAOS_POOL" ]; then echo "You must set DAOS_POOL" exit 1 fi if [ -z "$DAOS_CONT" ]; then echo "You must set DAOS_CONT" exit 1 fi # load daos/base module (if not loaded) module load daos/base module unload mpich/50.1/icc-all-pmix-gpu module use /soft/restricted/CNDA/updates/modulefiles module load mpich/50.2-daos/icc-all-pmix-gpu # print your module list (useful for debugging) module list # print your environment (useful for debugging) #env # turn on output of what is executed set -x # # clean previous mounts (just in case) # clean-dfuse.sh ${DAOS_POOL}:${DAOS_CONT} # launch dfuse on all compute nodes # will be launched using pdsh # arguments: # pool:container # may list multiple pool:container arguments # will be mounted at: # /tmp/<pool>/<container> launch-dfuse.sh ${DAOS_POOL}:${DAOS_CONT} # change to submission directory cd $PBS_O_WORKDIR # run your job(s) # these test cases assume 'testfile' is in the CWD cd /tmp/${DAOS_POOL}/${DAOS_CONT} echo "write" mpiexec -np $((rpn*nnodes)) \ -ppn $rpn \ -d $threads \ --cpu-bind numa \ --no-vni \ # enables DAOS access -genvall \ /soft/daos/examples/src/posix-write echo "read" mpiexec -np $((rpn*nnodes)) \ -ppn $rpn \ -d $threads \ --cpu-bind numa \ --no-vni \ # enables DAOS access -genvall \ /soft/daos/examples/src/posix-read # cleanup dfuse mounts clean-dfuse.sh ${DAOS_POOL}:${DAOS_CONT} exit 0
- Usage
Application can use POSIX codes as normal using a DAOS POSIX container. MPI-IO based codes can use the a DAOS MPICH ADIO by prepending the 'daos:' string to the path passed to MPI_File_open().
Additional, improved performance can be had by using a special preloaded library. Adding LD_PRELOAD=$DAOS_PRELOAD into the mpiexec command will enable kernel bypass of most POSIX I/O calls, but still use metadata via the FUSE mount point.