GPU Monitoring

Help Desk

Theta
Updated: 12/01/2020

Each GPU on ThetaGPU hosts 8 A100 GPUs. You can see information about these GPUs via the command nvidia-smi.

Each GPU has 40Gb of on-GPU memory. When you run applications, you will know the GPU is in use when you see the memory increase and the GPU Utilization will be non-zero.

You can target a specific GPU with nvidia-smi -i 0 for the first GPU, for example.

GPU Selection

In many application codes, you may want to specifiy which GPU is used. This is particular important in node-sharing applications where each GPU is running it's own code, which can be either in data-parallel model training, workflow based throughput jobs, etc. You can control individual process launches with:

# Specify to run only on GPU 4: 
export CUDA_VISIBLE_DEVICES=4 

# Let your application see GPUS 0, 1, and 7: 
export CUDA_VISIBLE_DEVICES="0,1,7"

In these cases, the GPU orderings will appear as a consecutive list starting with 0.

From inside an application, many software frameworks have ability to let you target specific GPUs, including tensorflow and pytorch:

Tensorflow

Pytorch

Systems