Running Jobs On ThetaGPU

Help Desk

Theta GPU Nodes

Note: Users will need an allocation on ThetaGPU to utilize the GPU nodes. Request for an allocation by filling out this form: Allocation request. ThetaGPU is listed under Theta on the form.

How do I run on multiple GPU nodes?

Until there is tighter integration of Cobalt and mpirun on GPU nodes, the user will have to identify the nodes Cobalt assigned to their job and pass them as options to mpirun along with some other mpirun options.  The following shows 2 different code snippets on how to get the hosts allocated to the job and pass them to mpirun

option 1 - simplest

mpirun -hostfile $COBALT_NODEFILE -n 16 -npernode 8 mpi-example-code

where $COBALT_NODEFILE is a file that the -hostfile option can use.

option 2 - little more complicated

HOSTS=$(cat $COBALT_NODEFILE | sed ':a;N;$!ba;s/\n/,/g')

mpirun  --np 16 --host $HOSTS --oversubscribe ./mpi-example-code

To specifically see how the MPI ranks were assigned, one could add --display-map --display-allocation to the mpirun options.

How do I control which Cobalt instance (KNL or GPU) my commands will be sent to?


Because of the difference in architectures and limitations in Cobalt V1, we are running two Cobalt instances, the existing one for the KNL nodes, which remains as is, and a second one for the GPU nodes.  You need to be able to control which instance you are interacting with and there are several ways to do so.

  • As was true in the past, if you do nothing, the commands will default to the architecture associated with the host you are on when you issue it
  • If you are on the Theta login nodes, commands will default to the KNL instance.
  • If you are on a GPU node, for instance the build nodes, then commands will default to the GPU instance.
  • You can set an environment variable to control which instance the default commands (qsub, qstat, etc) will interact with. The primary use case here will be users who only use GPU nodes, but are working from the Theta login nodes.  To do so, you may:
  • `module load cobalt/cobalt-knl` which would make cobalt commands interact with the original Cobalt instance and launch jobs on the KNL nodes
  • `module load cobalt/cobalt-gpu` which would make Cobalt commands interact with the new Cobalt instance and launch jobs on the GPU nodes
  • you can also set COBALT_CONFIG_FILES=<path to cobalt config>
    • knl config: /etc/cobalt.knl.conf
    • gpu config: /etc/cobalt.gpu.conf
  • You can use suffixed commands to explicitly control which instance you are interacting with. If you regularly use both types of nodes, this is the recommended path to avoid confusion and to prevent launching jobs on the wrong architecture.
  • all the commands you are used to are there, they take the same command line parameters, etc., they just have either -knl or a -gpu suffix on them. For instance:
    • qsub-knl <parameters> would submit a job to the KNL nodes
    • qstat-gpu would check the queue status for the GPU nodes.

How do I control whether I am requesting full DGX nodes or individual GPUs?

The DGX nodes, which contain (8) A100 GPUs, are extremely powerful and it can be very difficult for a single job to efficiently use an entire node.  For this reason, you may request either full nodes (all 8 GPUS) or individual GPUs.  What you are assigned (a node or a GPU) is dependent on the queue you submit to:

  • If the queue name ends in -node, you will get full nodes (8 A100 GPUs) 
  • If the queue name ends in -gpu, you will get an individual GPU
  • The -n parameter on the qsub is the number of resources of the type in that queue. So, for example:
    • `qsub -n 2 -q full-node <rest of command line>` would get you two full DGX nodes, which would be a total of (16) A100 GPUs
    • `qsub -n 2 -q single-gpu <rest of command line>` would get you two A100 GPUs
  • For reservations, you can only have one queue, and the resources in the queue need to be consistent, so your entire reservation must be in nodes or GPUs.  If you need both, you will need two reservations, one for each type of resource.
  • Node names are of the form thetagpu## where ## ranges from 01 to 24.  This is an entire node (8 GPUs)
  • GPU names are of the form thetagpu##-gpu# where the GPU numbers range from 0-7.

What is Multi-Instance GPU (MIG) mode?

The A100 GPUs have a capability known as Multi-Instance GPU (MIG). This allows you a single A100 GPU to be reconfigured at a hardware level down to a maximum of 7 instances. The valid configuration are shown in a table on the MIG page referenced above. These instances appear as a GPU to the application. In order to use this feature, the GPU must be put into MIG mode and this requires a reset of the GPU.  At the current time, we are not supporting scheduling at the MIG level.  However, a user can request that their GPU be put in MIG mode and then they can reconfigure the GPU into a supported configuration from their job script. 

If you wish to have the resources you have requested put into MIG mode you can add either of these to your qsub command line:

  • --attrs mig-mode=True 

Where can I find the details of a job submission?

Details of the job submission are recorded in the <jobid>.cobaltlog. This file contains the qsub command and environment variables. The location of this file can be controlled with the ‘qsub --debuglog <path>’ that defaults to the same place as the .output and .error files.

Why is my job stuck in "starting" state?

If you submit a job and qstat shows it in "starting" state for 5 minutes or more, most likely your memory/numa mode selection requires rebooting some or all of the nodes your job was assigned. This process takes about 15 minutes, during which your job appears to be in the "starting" phase. When no reboots are required, the "starting" phase only lasts a matter of seconds.

What are the "utime" and "stime" values printed at the bottom of the <jobid>.output file?

At the bottom of a <jobid>.ouput file, there is usually a line like:

Application 3373484 resources: utime ~6s, stime ~6s, Rss ~5036, inblocks ~0, outblocks ~8

What are the "utime" and "stime" values printed at the bottom of the .output file?

The "utime" and "stime" values are user CPU time and system CPU time from the aprun and getrusage commands. They are rounded aggregate numbers scaled by the number of resources used, and are approximate. The aprun man page has more information about them.

Do #COBALT directives need to start on the second line of job script?

Yes, if #COBALT directives are used inside a job submission script, then they must appear at the topmost lines of the script. #COBALT directives following a blank line will be ignored. Attempting to qsub the following example script will lead to the error message below.

> cat submit.csh #!/bin/csh #COBALT -n 128 -t 2:00:00 -q default aprun -n 8192 -N 64 -d 1 -j 1 --cc depth ./my_app > qsub submit.csh Usage: [options] [] Refer to man pages for JOBID EXPANSION and SCRIPT JOB DIRECTIVES. No required options provided

A correct submission script would look like the following with the blank line removed.

> cat submit.csh


#COBALT -n 128 -t 2:00:00 -q default

aprun -n 8192 -N 64 -d 1 -j 1 --cc depth ./my_app

> qsub submit.csh