Skip to content

Argonne Leadership
Computing Facility


The base conda environment on Polaris comes with Microsoft's DeepSpeed pre-installed. Instructions for using / cloning the base environment can be found here.

A batch submission script for the following example is available here.

We describe below the steps needed to get started with DeepSpeed on Polaris.

We focus on the cifar example provided in the DeepSpeedExamples repository, though this approach should be generally applicable for running any model with DeepSpeed support.


The instructions below should be ran directly from a compute node.

Explicitly, to request an interactive job (from polaris-login):

$ qsub -A <project> -q debug-scaling -l select=2 -l walltime=01:00:00

Refer to job scheduling and execution for additional information.

Running DeepSpeed on Polaris

  1. Clone microsoft/DeepSpeedExamples and navigate into the directory:

    $ git clone
    $ cd DeepSpeedExamples/cifar

  2. Load conda module and activate base environment:

    $ module load conda
    $ conda activate base
    $ which python3

  3. Create a DeepSpeed compliant hostfile, specifying the hostname and number of GPUs (slots) for each of our available workers:

    $ cat $PBS_NODEFILE > hostfile
    $ sed -e 's/$/ slots=4/' -i hostfile

  4. Create a .deepspeed_env containing the environment variables our workers will need access to:

    $ echo "PATH=${PATH}" >> .deepspeed_env
    $ echo "LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" >> .deepspeed_env
    $ echo "http_proxy=${http_proxy}" >> .deepspeed_env
    $ echo "https_proxy=${https_proxy}" >> .deepspeed_env


    The .deepspeed_env file expects each line to be of the form KEY=VALUE. Each of these will then be set as environment variables on each available worker specified in our hostfile.

  5. We can then run the module using DeepSpeed:

    $ deepspeed --hostfile=hostfile \
        --deepspeed \
        --deepspeed_config ds_config.json


Depending on the details of your specific job, it may be necessary to modify the provided ds_config.json.

If you encounter an error:

x3202c0s31b0n0: AssertionError: Micro batch size per gpu: 0 has to be greater than 0
you can modify the "train_batch_size": 16 variable in the provided ds_config.json to the (total) number of available GPUs, and explicitly set "gradient_accumulation_steps": 1, as shown below.
$ export NRANKS=$(wc -l < "${PBS_NODEFILE}")
$ export NGPU_PER_RANK=$(nvidia-smi -L | wc -l)
$ export NGPUS="$((${NRANKS}*${NGPU_PER_RANK}))"
24 4 96
$ cat ds_config.json  # note: 16 --> 96 in "train_batch_size"
    "train_batch_size": 96,
    "gradient_accumulation_steps": 1,