CrayPat

Help Desk

Theta

Introduction

The Cray Performance Measurement and Analysis Tools (or CrayPat) are a suite of utilities that enable the user to capture and analyze performance data generated during the execution of a program on a Cray system. It includes CrayPat, CrayPat-lite, Cray Apprentices2, Reveal and the Cray PAPI components. 

CrayPat-lite

This section focuses on CrayPat-lite, the simplified, easy-to-use version of the Cray Performance Measurement and Analysis Tool set. CrayPat-lite provides basic performance analysis information automatically, with a minimum of the user interaction. While CrayPat provides more functionalities. 

1. Environment Setup

To use CrayPat-lite, the appropriate module should be loaded, and darshan should be unloaded.

$ module unload darshan

$ module load perftools-lite

Note:

  • Darshan needs to be unloaded since it uses environment variables which CrayPat-lite also uses.
  • Note that the perftools-base module should already be listed under “module list” If not, load perftools base with “module load perftools-base”. This provides access to man pages and help system, and does not instrument code. (Should be loaded by default.)
  • “module load perftools-lite” loads performance instrumentation module, and will instrument programs when they are compiled.

Tip:

  • To verify that the correct modules are loaded, run "module list" and check that perftools-lite and perftools-base are loaded and that darshan is not present.

2. Compile the Code to Use CrayPat-lite

Rebuild code as usual. When the program is built, there should be output like below, noting that CrayPat-lite was used.

$ make clean 
$ make ftn -O3 -qopt-report=5 -g -align array64byte test.f90 -o test 
INFO: creating the CrayPat-instrumented executable 'test' (sample_profile) ...OK

Tip:

  • To verify that the executable was instrumented with CrayPat-lite, use the following command to search the executable for CrayPat strings:
$ strings test | grep "CrayPat/X" 
CrayPat/X: Version 7.0.0 Revision 5c29ce2 12/11/17 15:26:24

3. Run the Code

Run the code as usual

$ qsub -n 8 -t 30 -A project jobscript.sh

$ cat jobscript.sh

#!/bin/bash

aprun -n 512 -N 64 test

4. Output

After the code finishes executing, CrayPat-lite output should be printed to stdout (likely at the end of the jobid.output file). More output will be saved in .rpt files and .ap2 files, which might be under a new directory created in the directory the run occurred in.

An example of output is shown below. Note that it is truncated after Table 1, since this is just for illustration purposes.

$ cat 309226.output

…

Normal program output

…

#################################################################
#                                                               #
#            CrayPat-lite Performance Statistics                #
#                                                               #
#################################################################

 
CrayPat/X:  Version 7.0.0 Revision 5c29ce2  12/11/17 15:26:24
Experiment:                  lite  lite/sample_profile
Number of PEs (MPI ranks):    512
Numbers of PEs per Node:       64  PEs on each of  8  Nodes
Numbers of Threads per PE:      1
Number of Cores per Socket:    64
Execution start time:  Thu Mar 22 18:09:16 2018
System name and speed:  nid00690  1.301 GHz (nominal)
Intel Knights Landing CPU  Family:  6  Model: 87  Stepping:  1
MCDRAM: 7.2 GHz, 16 GiB available as quad, cache (100% cache)

 

Avg Process Time:           74.06 secs
High Memory:             21,298.6 MBytes     41.6 MBytes per PE
Instr per Cycle:             1.16
Observed CPU cycle rate:     1.39 GHz
I/O Read Rate:           1.899058 MBytes/sec
I/O Write Rate:          0.447113 MBytes/sec

Table 1:  Profile by Function
 

  Samp% |    Samp |  Imb. |  Imb. | Group
        |         |  Samp | Samp% |  Function=[MAX10]
        |         |       |       |   PE=HIDE
 100.0% | 7,355.9 |    -- |    -- | Total
|-------------------------------------------------------------
|  67.0% | 4,925.5 |    -- |    -- | USER
||------------------------------------------------------------
||  48.7% | 3,582.1 | 338.9 |  8.7% | genral_
||   9.9% |   726.7 |  85.3 | 10.5% | xyzint_
||   6.5% |   481.6 |  80.4 | 14.3% | rt123_
||============================================================
|  21.5% | 1,580.0 |    -- |    -- | BLAS
||------------------------------------------------------------
||   8.6% |   634.0 |  70.0 | 10.0% | gotoblas_dgetrf_single_knl
||   5.7% |   417.1 |  64.9 | 13.5% | gotoblas_dgemm_kernel_knl
||   2.3% |   168.5 |  34.5 | 17.0% | gotoblas_dlaswp_plus_knl
||   2.0% |   143.8 |  34.2 | 19.2% | gotoblas_dgemv_n_knl
||============================================================
|   8.7% |   637.3 |    -- |    -- | MPI
||------------------------------------------------------------
||   8.3% |   610.0 | 443.0 | 42.2% | MPI_ALLREDUCE
||============================================================
|   2.8% |   206.5 |    -- |    -- | ETC
||------------------------------------------------------------
||   2.6% |   193.8 |  60.2 | 23.7% | __svml_exp8_mask_b3
|=============================================================

By default, the sampling rate for CrayPat-lite is 100 times per second (or once every 104 microseconds) (check the runtime environment variable PAT_RT_SAMPLING_INTERVAL (given in microseconds)to verify the sampling rate).

Samp% is the percent of total samples taken which occurred in the given routine, averaged over all processes.

Samp is the number of samples which occurred in the given routine, averaged over all processes.

Imb. Samp is (maximum number of samples taken in given routine by one process - number of samples taken in given routine, averaged over all processes).

Imb. Samp% is (Imb. Samp) / (maximum number of samples taken in given routine by one process) * (number of processes / (number of processes - 1)) * 100%

Based on the above, 48.7% of total samples occurred in the genral_ routine, averaged over all processes, and the most samples taken in genral_ by one process process differed from the average by 338.9 samples. (If it was perfectly load balanced, Imb. Samp would be 0.) Additionally, 21.1% samples occurred in BLAS routines, averaged over all processes.

We can also see that this was run on 8 KNL nodes with 64 MPI ranks/node, for a total of 512 MPI processes.

More details explaining the output and how to get more information can be found in the References, training assets and the man page.

Manual pages (after module is loaded):

General information about perftools-lite:

$ man perftools-lite

Detailed information about the output report:

$ man pat_report

CrayPat

CrayPat supports two categories of performance analysis experiments: tracing experiments, which count some event such as the number of times a specific system call is executed, and asynchronous (sampling) experiments, which capture values at specified time intervals or when a specified counter overflows.

1. Environmental Setup

To use CrayPat, unload darshan module and load perftools-base and perftools modules. 

jkwack@thetamom2:~/CrayPat_Test> module unload darshan
jkwack@thetamom2:~/CrayPat_Test> module load perftools-base perftools

2. Build your application and instrument it with pat_build

The pat_build command is the instrumenting component of the CrayPat performance analysis tool. After loading the perftools-base and perftools instrumentation modules and recompiling the program, use the pat_build command to instrument the program for data capture. There are several ways to instrument your application as follows:

  • For the default Automatic Profiling Analysis
    $ pat_build a.out
  • For predefined trace groups:
    $ pat_build -g tracegroup a.out
  • For enabling tracing and the CrayPat API:
    $ pat_build -w a.out
  • For instrumenting a single function:
    $ pat_build -T tracefunc a.out
  • For instrumenting a list of functions:
    $ pat_build -t tracefile a.out

The pat_build command is documented in more detail in the pat_build(1) man page. For additional information and examples, see pat_help build.

The following is an example on an interactive job mode on Theta:

jkwack@thetamom2:~/CrayPat_Test> ftn -dynamic -Ofast -fopenmp test.f90 -o test
WARNING: PerfTools is saving object files from a temporary directory into directory '/home/jkwack/.craypat/test/70303'
jkwack@thetamom2:~/CrayPat_Test> pat_build test
jkwack@thetamom2:~/CrayPat_Test> ls -al test*
-rwxr-xr-x 1 jkwack users   46568 Mar  4 19:57 test
-rw-r--r-- 1 jkwack users    6143 Mar  4 19:57 test.f90
-rwxr-xr-x 1 jkwack users 1399496 Mar  4 19:59 test+pat

3. Run your instrumented binary 

Run your instrumented executable binary ending with +pat with your usual runtime parameters.

jkwack@thetamom2:~/CrayPat_Test> aprun -n 64 ./test+pat 4096 100

CrayPat/X:  Version 7.0.5 Revision 82c1110  11/21/18 11:50:49

                      Number of MPI process:     64

                                  Precision: double

       Number of rows/columns of the matrix:   4096

      The highest order of geometric series:    100

                      Number of repetitions:     10

                  Memory Usage per MPI rank:   268.435456 MB

          Warming up ......

          Main Computations  10 repetitions ......

          0%                     25%                      50%                     75%                     100%

          ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

                        Error_MPI_{Min,Mean,Max}/MPI =   0.7878E-11    0.7878E-11    0.7878E-11

                       GFLOP-rate_{Min,Mean,Max}/MPI =     0.449981      0.451127      0.451217

                                           Wall Time =    74.568634 sec

                                           FLOP-rate =    28.798753 GFLOP/sec

Experiment data directory written:

/gpfs/mira-home/jkwack/CrayPat_Test/test+pat+9755-3826s

Application 19251348 resources: utime ~5395s, stime ~106s, Rss ~304068, inblocks ~20862, outblocks ~0

4. Post-processing with pat_report

The pat_report command is the text reporting component of the Cray Performance Analysis Tools suite. After using the pat_build command to instrument the program, set the run time environment variables as desired, and then execute the program, use the pat_report command to generate text reports from the resulting data and export the data for use in other applications.

The pat_report command is documented in detail in the pat_report(1) man page. Additional information can be found in the online help system under pat_help report.

The following is for the above example on an interactive mode on Theta:

kwack@thetamom2:~/CrayPat_Test> pat_report -o test.CrayPat_report.txt test+pat+9755-3826s

Processing step 1 of 9

Suggested trace options file:   test+pat+9755-3826s/build-options.apa


Processing step 9 of 9

See test.CrayPat_report.txt file for the detailed performance data:

jkwack@thetamom2:~/CrayPat_Test> cat test.CrayPat_report.txt

CrayPat/X:  Version 7.0.5 Revision 82c1110  11/21/18 11:50:49


Number of PEs (MPI ranks):   64

                           

Numbers of PEs per Node:     64

                           

Numbers of Threads per PE:    1

                           

Number of Cores per Socket:  64


Execution start time:  Wed Mar  4 20:03:22 2020


System name and speed:  nid03826  1.301 GHz (nominal)


Intel Knights Landing CPU  Family:  6  Model: 87  Stepping:  1


DRAM: 192 GiB DDR4-2400 on 1.3 GHz nodes


MCDRAM: 7.2 GHz, 16 GiB available as quad, cache (100% cache)


Current path to data file:

  /home/jkwack/CrayPat_Test/test+pat+9755-3826s   (RTS, 64 data files)



Notes for table 1:


  This table shows functions that have significant exclusive sample

    hits, averaged across ranks.

  For further explanation, see the "General table notes" below,

    or use:  pat_report -v -O samp_profile ...


Table 1:  Profile by Function (limited entries shown)


  Samp% |    Samp | Imb. |  Imb. | Group

        |         | Samp | Samp% |  Function

        |         |      |       |   PE=HIDE

       

100.0% | 8,396.2 |   -- |    -- | Total

|-------------------------------------------

|  99.1% | 8,324.3 | 16.7 |  0.2% | USER

||------------------------------------------

||  99.1% | 8,324.3 | 16.7 |  0.2% | MAIN__

|===========================================


Notes for table 2:


  This table shows functions that have the most significant exclusive

    time, taking the maximum time across ranks and threads.

  For further explanation, see the "General table notes" below,

    or use:  pat_report -v -O profile_max ...


Table 2:  Profile of maximum function times (limited entries shown)


  Samp% |    Samp | Imb. |  Imb. | Function

        |         | Samp | Samp% |  PE=[max,min]

|-----------------------------------------------

| 100.0% | 8,341.0 | 16.7 |  0.2% | MAIN__

||----------------------------------------------

|| 100.0% | 8,341.0 |   -- |    -- | pe.0

||  99.6% | 8,308.0 |   -- |    -- | pe.10

||==============================================

|   1.0% |    87.0 | 15.8 | 18.5% | __powidf2

||----------------------------------------------

||   1.0% |    87.0 |   -- |    -- | pe.10

||   0.7% |    60.0 |   -- |    -- | pe.3

|===============================================


Notes for table 3:


  This table shows functions, and line numbers within functions, that

    have significant exclusive sample hits, averaged across ranks.

  For further explanation, see the "General table notes" below,

    or use:  pat_report -v -O samp_profile+src ...


Table 3:  Profile by Group, Function, and Line (limited entries shown)


  Samp% |    Samp | Imb. |  Imb. | Group

        |         | Samp | Samp% |  Function

        |         |      |       |   Source

        |         |      |       |    Line

        |         |      |       |     PE=HIDE

       

100.0% | 8,396.2 |   -- |    -- | Total

|--------------------------------------------------------------------

|  99.1% | 8,324.3 |   -- |    -- | USER

||-------------------------------------------------------------------

||  99.1% | 8,324.3 |   -- |    -- | MAIN__

3|        |         |      |       |  /home/jkwack/CrayPat_Test/test.f90

||||-----------------------------------------------------------------

4|||  95.2% | 7,990.0 | 25.0 |  0.3% | line.180

4|||   2.0% |   168.2 | 28.8 | 14.9% | line.183

|====================================================================


Notes for table 4:


  This table shows HW performance counter data for the whole program,

    averaged across ranks or threads, as applicable.

  For further explanation, see the "General table notes" below,

    or use:  pat_report -v -O hwpc ...


Table 4:  Program HW Performance Counter Data (limited entries shown)


PE=HIDE


  

==============================================================================

  Total

------------------------------------------------------------------------------

  Thread Time                                     84.260105 secs

  UNHALTED_CORE_CYCLES                      116,375,995,355

  UNHALTED_REFERENCE_CYCLES                 108,063,453,270

  INSTRUCTION_RETIRED                        77,007,487,327

  OFFCORE_RESPONSE_0:ANY_REQUEST:DDR             24,281,189

  OFFCORE_RESPONSE_1:ANY_REQUEST:MCDRAM          78,534,610

  CPU_CLK                             1.38GHz               

  CPU CLK Boost                                        1.08 X

  Retired Inst per Clock                               0.66

  Memory traffic GBytes              0.078G/sec        6.58 GB

  MCDRAM Memory traffic GBytes       0.060G/sec        5.03 GB

  DDR Memory traffic GBytes          0.018G/sec        1.55 GB

==============================================================================


Notes for table 5:


  This table show the average time and number of bytes written to each

    output file, taking the average over the number of ranks that

    wrote to the file.  It also shows the number of write operations,

    and average rates.

  For further explanation, see the "General table notes" below,

    or use:  pat_report -v -O write_stats ...


Table 5:  File Output Stats by Filename (limited entries shown)


      Avg |      Avg |  Write Rate | Number |    Avg | Bytes/ | File Name=!x/^/(proc|sys)/

    Write |    Write | MiBytes/sec |     of | Writes |   Call |  PE=HIDE

Time per |  MiBytes |             | Writer |    per |        |

   Writer |      per |             |  Ranks | Writer |        |

     Rank |   Writer |             |        |   Rank |        |

          |     Rank |             |        |        |        |

|-----------------------------------------------------------------------------

| 0.000761 | 0.000921 |    1.211195 |      1 |   35.0 |  27.60 | stdout

|=============================================================================


Notes for table 6:


  This table shows energy and power usage for the nodes with the

    maximum, mean, and minimum usage, as well as the sum of usage over

    all nodes.

    Energy and power for accelerators is also shown, if applicable.

  For further explanation, see the "General table notes" below,

    or use:  pat_report -v -O program_energy ...


Table 6:

  Program energy and power usage (from Cray PM) (limited entries shown)


   Node |     Node |   Process | PE=HIDE

Energy |    Power |      Time |

    (J) |      (W) |           |

       

14,264 |  169.042 | 84.382329 | Total


Notes for table 7:


  This table shows memory traffic to DDR and MCDRAM for numa nodes,

    taking for each numa node the maximum value across nodes. It also

    shows the balance in memory traffic by showing the top 3 and

    bottom 3 node values.

  For further explanation, see the "General table notes" below,

    or use:  pat_report -v -O mem_bw_mcdram ...


Table 7:  Memory Bandwidth by Numanode (limited entries shown)


  Memory |     DDR |  MCDRAM |    Thread |  Memory | Numanode

Traffic |  Memory |  Memory |      Time | Traffic |  PE=HIDE

  GBytes | Traffic | Traffic |           |  GBytes |

         |  GBytes |  GBytes |           |   / Sec |

|--------------------------------------------------------------

|  421.13 |   99.46 |  321.68 | 84.492077 |    4.98 | numanode.0

|==============================================================


Notes for table 8:


  This table shows total wall clock time for the ranks with the

    maximum, mean, and minimum time, as well as the average across

    ranks.

    It also shows maximum memory usage from /proc/self/numa_maps for

    those ranks, and on average.  The usage is total size of all

    pages, including huge pages, that were actually mapped into

    physical memory from both private and shared memory segments.

  For further explanation, see the "General table notes" below,

    or use:  pat_report -v -O program_time ...


Table 8:  Wall Clock Time, Memory High Water Mark (limited entries shown)


   Process |   Process | PE=[mmm]

      Time |     HiMem |

           | (MiBytes) |

          

84.382329 |     282.2 | Total

|--------------------------------

| 84.616624 |     284.2 | pe.0

| 84.377615 |     281.8 | pe.31

| 84.345358 |     282.2 | pe.49

|================================


========================  Additional details  ========================




General table notes:


    The default notes for a table do not account for the effects

    of additional command-line options, but the detailed notes

    produced by the -v option do account for them.

    

    An imbalance metric in a line is based on values in main threads

    across multiple ranks, or on values across all threads, as applicable.

    

    An imbalance percent in a line is relative to the maximum value

    for that line across ranks or threads, as applicable.

    

Experiment:  samp_cs_time


Sampling interval:  10000 microsecs


Original path to data file:

  /gpfs/mira-home/jkwack/CrayPat_Test/test+pat+9755-3826s/xf-files   (RTS, 64 data files)


Original program:  /gpfs/mira-home/jkwack/CrayPat_Test/test


Instrumented with:  pat_build test


  Option file "apa" contained:

    -Drtenv=PAT_RT_PERFCTR=default_samp

    -Drtenv=PAT_RT_EXPERIMENT=samp_cs_time

    -Drtenv=PAT_RT_SAMPLING_MODE=3

    -g upc

    -g caf

    -g mpi

    -g shmem

    -g syscall

    -g io


Instrumented program:  ./test+pat


Program invocation:  ./test+pat 4096 100


Exit Status:  0 for 64 PEs


Memory pagesize:  4 KiB


Memory hugepagesize:  Not Available


Programming environment:  GNU


Runtime environment variables:

  CRAYPAT_ALPS_COMPONENT=/opt/cray/pe/perftools/7.0.5/sbin/pat_alps

  CRAYPAT_COMPILER_OPTIONS=1

  CRAYPAT_LD_LIBRARY_PATH=/opt/cray/pe/gcc-libs:/opt/cray/gcc-libs:/opt/cray/pe/perftools/7.0.5/lib64

  CRAYPAT_OPTS_EXECUTABLE=sbin/pat-opts

  CRAYPAT_ROOT=/opt/cray/pe/perftools/7.0.5

  CRAYPE_VERSION=2.6.1

  CRAY_LIBSCI_VERSION=19.06.1

  GCC_VERSION=8.3.0

  GNU_VERSION=8.3.0

  LIBSCI_VERSION=19.06.1

  MODULE_VERSION=3.2.11.3

  MODULE_VERSION_STACK=3.2.11.3

  MPICH_ABORT_ON_ERROR=1

  MPICH_DIR=/opt/cray/pe/mpt/7.7.10/gni/mpich-gnu/8.2

  PAT_BUILD_PAPI_BASEDIR=/opt/cray/pe/papi/5.6.0.5

  PAT_REPORT_PRUNE_NAME=_cray$mt_execute_,_cray$mt_start_,__cray_hwpc_,f_cray_hwpc_,cstart,__pat_,pat_region_,PAT_,OMP.slave_loop,slave_entry,_new_slave_entry,_thread_pool_slave_entry,THREAD_POOL_join,__libc_start_main,_start,__start,start_thread,__wrap_,UPC_ADIO_,_upc_,upc_,__caf_,__pgas_,syscall,__device_stub

  PAT_RT_EXPERIMENT=samp_cs_time

  PAT_RT_PERFCTR=default_samp

  PAT_RT_SAMPLING_MODE=3

  PERFTOOLS_VERSION=7.0.5

  PMI_FORK_RANK=0

  PMI_GNI_COOKIE=4202364928:4202430464

  PMI_GNI_DEV_ID=0:0

  PMI_GNI_LOC_ADDR=4978:4978

  PMI_GNI_PTAG=24:25


Report time environment variables:

    CRAYPAT_ROOT=/opt/cray/pe/perftools/7.0.5

    PAT_REPORT_PRUNE_NAME=_cray$mt_execute_,_cray$mt_start_,__cray_hwpc_,f_cray_hwpc_,cstart,__pat_,pat_region_,PAT_,OMP.slave_loop,slave_entry,_new_slave_entry,_thread_pool_slave_entry,THREAD_POOL_join,__libc_start_main,_start,__start,start_thread,__wrap_,UPC_ADIO_,_upc_,upc_,__caf_,__pgas_,syscall,__device_stub


Number of MPI control variables collected:  108


  (To see the list, specify: -s mpi_cvar=show)


Report command line options:  -o test.CrayPat_report.txt


Operating system:

  Linux 4.4.103-6.38_4.0.154-cray_ari_c #1 SMP Fri Aug 30 18:08:26 UTC 2019 (0e37142)


Hardware performance counter events:

   UNHALTED_CORE_CYCLES                   Unhalted core cycles

   UNHALTED_REFERENCE_CYCLES              Unhalted reference cycle

   INSTRUCTION_RETIRED                    Instructions retired (any thread modifier supported in fixed counter)

   OFFCORE_RESPONSE_0:ANY_REQUEST:DDR     Offcore response event (must provide at least one request type andeither any_response or any combination of supplier + snoop):Counts any request:accounts for responses from DDR (local and far)

   OFFCORE_RESPONSE_1:ANY_REQUEST:MCDRAM  Offcore response event (must provide at least one request type andeither any_response or any combination of supplier + snoop):Counts any request:accounts for responses from MCDRAM (local and far)

5. In-depth Analysis: Using Cray Apprentice2

Cray Apprentice2 is a GUI tool for visualizing and manipulating the performance analysis data captured during program execution. After using pat_report to open the initial .xf data file(s) and generate the .ap2 file(s), use Cray Apprentice2 to open and explore the .ap2 file(s) in further detail.

The following is an example on a login node on Theta:

jkwack@thetalogin5:~> module unload darshan
jkwack@thetalogin5:~> module load perftools-base perftools
jkwack@thetalogin5:~> app2
CrayPat-01.png
CrayPat-02.png
CrayPat-03.png
CrayPat-04.png
CrayPat-05.png
CrayPat-06.png
CrayPat-07.png
CrayPat-08.png
CrayPat-09.png