Skip to content

Argonne Leadership Computing Facility

Argonne Leadership
Computing Facility

Miscellaneous

SDK Version

To find the SDK version, run the following commands

(venv) ALCFUserID@sm-01:~$ python
Python 3.7.6 (default, Feb 18 2020, 21:28:31) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sambaflow
>>> sambaflow.__version__
'1.11.5'
>>> 

OMP_NUM_THREADS

The OMP_NUM_THREADS environment variable sets the number of threads to use for parallel regions.

The value of this environment variable must be a list of positive integer values. The values of the list set the number of threads to use for parallel regions at the corresponding nested levels.

For the SambaNova system it, is usually set to one.

export OMP_NUM_THREADS=1

Where is the Model?

Two copies of the model are maintained.  One in CPU memory and one in RDU memory. They do not interfere with each other unless you explicitly sync the model/parameter in between using:

SambaTensor.rdu() # Moves the CPU model to the RDU
SambaTensor.cpu() # Moves the RDU model to the CPU

In order to run the model on the CPU, you can simply use the PyTorch model as if there is no RDU. In order to run the model on RDU, you would need to use session.run().

Useful Commands

SN Configuration

snconfig

The snconfig utility shows the static configuration of the system. The configuration on sm-01 for the first RDU is as follows:

Platform Name: DataScale SN10-8
Node Name: NODE
Number of XRDUS: 4
XRDU Name: XRDU_0
Number of RDUS: 2
RDU name: RDU_0
Number of TILES: 4
TILE Name: TILE_0
Serial Number : N/A
...
Number of PCIES: 4
PCIE Name: PCIE_0
Bandwidth : 32 GB/s
Speed : 16 GT/s
Width : 16
Serial Number : N/A
...
Number of DDRCHs: 6
DDR CH Name: DDRCH_0
Number of DIMMS: 2
DIMM Name: DIMM_C0
Size : 64.0 GB
DIMM Name: DIMM_C1
Size : 0.0 GB
Serial Number : N/A
Current utilization can be seen with sntilestat. In this example, only
four tiles in one RDU are in use.
TILE %idle %exec %pload %aload %chkpt %quiesce PID USER COMMAND
/XRDU_0/RDU_0/TILE_0 80.4 7.0 10.4 2.2 0.0 0.0 49880 arnoldw python
res_ffn_mnist.py run --pef=pef/res_ffn_mnist/res_ffn_mnist.pef
--num-epochs 100
/XRDU_0/RDU_0/TILE_1 80.5 6.9 11.3 1.3 0.0 0.0 49880 arnoldw python
res_ffn_mnist.py run --pef=pef/res_ffn_mnist/res_ffn_mnist.pef
--num-epochs 100
/XRDU_0/RDU_0/TILE_2 82.1 4.7 11.4 1.8 0.0 0.0 49880 arnoldw python
res_ffn_mnist.py run --pef=pef/res_ffn_mnist/res_ffn_mnist.pef
--num-epochs 100
/XRDU_0/RDU_0/TILE_3 80.1 6.3 11.7 1.9 0.0 0.0 49880 arnoldw python
res_ffn_mnist.py run --pef=pef/res_ffn_mnist/res_ffn_mnist.pef
--num-epochs 100
/XRDU_0/RDU_1/TILE_0 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_0/RDU_1/TILE_1 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_0/RDU_1/TILE_2 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_0/RDU_1/TILE_3 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_1/RDU_0/TILE_0 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_1/RDU_0/TILE_1 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_1/RDU_0/TILE_2 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_1/RDU_0/TILE_3 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_1/RDU_1/TILE_0 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_1/RDU_1/TILE_1 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_1/RDU_1/TILE_2 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_1/RDU_1/TILE_3 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_2/RDU_0/TILE_0 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_2/RDU_0/TILE_1 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_2/RDU_0/TILE_2 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_2/RDU_0/TILE_3 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_2/RDU_1/TILE_0 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_2/RDU_1/TILE_1 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_2/RDU_1/TILE_2 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_2/RDU_1/TILE_3 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_3/RDU_0/TILE_0 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_3/RDU_0/TILE_1 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_3/RDU_0/TILE_2 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_3/RDU_0/TILE_3 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_3/RDU_1/TILE_0 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_3/RDU_1/TILE_1 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_3/RDU_1/TILE_2 100.0 0.0 0.0 0.0 0.0 0.0
/XRDU_3/RDU_1/TILE_3 100.0 0.0 0.0 0.0 0.0 0.0

SambaNova Daemon Service

The following command checks if the SambaNova daemon service is running.

systemctl status snd

The output should look something like this:

* snd.service - SN Devices Service
   Loaded: loaded (/usr/lib/systemd/system/snd.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2022-02-18 11:45:15 CST; 1 months 25 days ago
 Main PID: 3550 (snd)
    Tasks: 10 (limit: 19660)
   CGroup: /system.slice/snd.service
           `-3550 /opt/sambaflow/bin/snd

Warning: Journal has been rotated since the unit was started. Log output is incomplete or unavailable.

Tile status

sntilestat
watch sntilestat

The output shown below is when the system is completely idle.

TILE                 %idle %exec %pload %aload %chkpt %quiesce    PID     USER COMMAND
/XRDU_0/RDU_0/TILE_0 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_0/RDU_0/TILE_1 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_0/RDU_0/TILE_2 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_0/RDU_0/TILE_3 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_0/RDU_1/TILE_0 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_0/RDU_1/TILE_1 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_0/RDU_1/TILE_2 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_0/RDU_1/TILE_3 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_1/RDU_0/TILE_0 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_1/RDU_0/TILE_1 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_1/RDU_0/TILE_2 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_1/RDU_0/TILE_3 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_1/RDU_1/TILE_0 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_1/RDU_1/TILE_1 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_1/RDU_1/TILE_2 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_1/RDU_1/TILE_3 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_2/RDU_0/TILE_0 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_2/RDU_0/TILE_1 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_2/RDU_0/TILE_2 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_2/RDU_0/TILE_3 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_2/RDU_1/TILE_0 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_2/RDU_1/TILE_1 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_2/RDU_1/TILE_2 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_2/RDU_1/TILE_3 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_3/RDU_0/TILE_0 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_3/RDU_0/TILE_1 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_3/RDU_0/TILE_2 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_3/RDU_0/TILE_3 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_3/RDU_1/TILE_0 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_3/RDU_1/TILE_1 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_3/RDU_1/TILE_2 100.0   0.0    0.0    0.0    0.0      0.0
/XRDU_3/RDU_1/TILE_3 100.0   0.0    0.0    0.0    0.0      0.0

Finding Hung Tiles

snconfig show Node dynamic | grep perfect

How busy is the system?

Use one of

top
htop