Major OS Update on Theta

July 22 Update on OS Upgrade 

OS upgrade has been completed and was released back to the users on July 22.

Please note that some of programming environment (PE) changes previously announced have been deferred to July 27 (see PE completion notes from July 27 here).

[Completion Update]

Original notification below.

===================

 

Major OS Upgrade

In order to make newer software versions available and to improve security and reliability—as well as to help pave the way for Theta to connect with the coming global filesystem—we will be updating Theta’s major OS versions starting at 12:00 CDT, Monday, July 13 to Wednesday, July 22, 2020.

We will upgrade ALCF-Theta from its current SLES12/CNL6 configuration to SLES15/CNL7. This necessary enhancement will be the last major OS version update of Theta. We will also be upgrading Theta’s Sonexion Lustre filesystem.

This will require some user codes to be rebuilt and/or relinked.

Users should note that:

  • some user codes need to be rebuilt and/or relinked.
  • Statically linked applications should not be affected, unless they depend on the OS in some non-obvious way. The default will remain static linking.
  • Hugepages of various sizes will be available as loadable modules, with none loaded by default.
  • Codes running against the current Cray Programming Environment, that load its modules for software should not be impacted, assuming the software and module versions are still available.

Improvements

  • Newer versions of system software will offer better support for user software.
  • Newer compiler versions will become available.
  • New builds will be valid though Theta's end-of-life.
  • Default routing has changed from Adaptive-0 to Adaptive-3 to reduce congestion effects.

Known impacts and remediations

We are taking measures to minimize impacts and ensure continuity of available software. The current list of known impacts and remediations: (Note: some of programming environment changes previously announced have been deferred)

  • Cray will no longer be supporting HDF5 1.8.x versions. We will deploy our own build of 1.8.16. 
  • Intel compilers and software products are not expected to be affected.
  • Python support under Intel python should be minimally affected, similarly for conda or user-provided pythons. However, if any of a code's underlying dependencies are reliant upon packages the Cray programming environment or upon specific versions of system libraries, they *may* be affected. The upgraded OS will support both python 2.7.17 and 3.6.10. 
  • This will require a clean wipe of the current programming environment, including spack-built software.
  • Many user codes will need to be re-built and/or re-linked against the newer versions of programming environment and spack provided dependencies.
  • Some older versions of cray-built packages will no longer be available. In some instances this may require migrating to a newer version of a dependency, or sanctioning a spack-built replacement for the cray package.
  • Summary of Cray software versions that will NOT be available after upgrade. Please contact support@alcf.anl.gov if any of these are essential for you:
    • ATP 2.1.0, 2.1.1, 2.1.2, 2.1.3
    • CCE 8.5.*, 8.7.9, 9.0.2
    • FFTW 3.3.4.11, 3.3.8.1, 3.3.8.2, 3.3.8.3
    • GA 5.3.0.7-9
    • HDF5 1.10.0, 1.10.2.0, 1.10.5.1
    • LGDB 3.0.5-7
    • LibSci 16.*, 17.*, 18.*
    • MPT 7.5.x, 7.6.x, 7.7.6, 7.7.2-4
    • NetCDF 4.4.*, 4.6.1.3, 4.6.3.1
    • PAPI 5.5.*, 5.6.*
    • Perftools 6.5.*, 7.0.*, 7.1.1
    • PETSC 3.7.*, 3.8.*, 3.9.*
    • STAT 2.*, 3.0.*
    • TPSL 17.*, 18.*

Here is the before-vs-after diff of the OS upgrade.

As much as is practical, we are encouraging users to migrate to newer versions. Our users are a tremendous resource and we are reaching out for feedback during this process so that we can address and test migration issues in a timely matter.

Changes to adaptive routing

The adaptive routing has been changed from "Adaptive 0" to "Adaptive 3". This change will result in positive overall performance improvement for applications especially those that are sensitive to network latency. Although we don't suggest it, any application may return to the default ADAPTIVE_0 by unloading the adaptive-routing-a3 module.

module unload adaptive-routing-a3

We appreciate any feedback on this change and how it may have impacted your application performance. Theta (Cray XC 40) uses the packet-level adaptive routing† which transfers packets on network potentially avoiding the congested links. This ensures balancing the network load on the available paths thereby realizing high network utilization even under heavy network load.

Adaptive routing on Aries comes in four different flavors differed by the way the weighting (bias) given to the minimal vs. nonminimal paths. The default adaptive routing used so far on Theta is ADAPTIVE_0 which has no bias towards minimal or nonminimal. Our recent research found that ADAPTIVE_3 which has a strong bias towards minimal routing is optimal for majority of the workloads as well as overall system-level congestion management, hence a recommendation was made to switch the default routing mode on Theta to ADAPTIVE_3.

https://www.cray.com/sites/default/files/resources/CrayXCNetwork.pdf