About the Slingshot 11 Upgrade
Polaris's interconnect is being upgraded from Slingshot 10 to Slingshot 11 during maintenance on October 30, November 6, and November 13.
This upgrade will double the NIC maximum bandwidth from 100 to 200Gbs, and this change from a Mellanox-based NIC to a “true” HPE Cassini-based Slingshot NIC will open the door for future features.
Phased Rollout
We will perform the upgrade in three phases from October 30 to November 13.
- October 30: System 30% upgraded
- November 6: System 60% upgraded
- November 13: System upgrade complete
The phased approach will give you four weeks to test your applications using Slingshot 11 while adding minimal additional system downtime. As you test your applications, please notify support@alcf.anl.gov to report issues as you find them.
Additional Information to Know
- The maximum job size in the prod queue will be reduced during the transition as shown in the table here:
https://docs.alcf.anl.gov/polaris/running-jobs. Existing job scripts in the prod queue will run as usual (with the exception of the reduced maximum job size). -
The maximum job size will be reduced during the transition as shown in the table here:
- 30% of the nodes will be Slingshot 11 after Oct. 30, 60% will be Slingshot 11 after Nov. 6, and all the nodes will be Slingshot 11 after Nov. 13.
- It is important to note that if you do not account for this change in maximum job size in your job submissions you could have jobs that sit in the queue for four weeks with a comment of “insufficient resources”.
- During the upgrade period, there will be a Slingshot 11 queue (called "ss11") for testing and running jobs against the Slingshot 11 nodes. Some of the existing scripts may need to be updated and/or codes may need to rebuilt. The job limits are broad so that people can test at scale, with no limits on node count. See this page for queue policies: https://docs.alcf.anl.gov/polaris/running-jobs. Please be respectful of your fellow users and do not monopolize the queue.
- Preemptable and demand queues will be Slingshot 11 after the maintenance on Oct 30.
- Debug and debug-scaling queues will be Slingshot 11 after the maintenance on Nov 13.
- One of the login nodes will be converted to Slingshot 11, which can be used to rebuild code if necessary, though some codes run unchanged. Additional details on required modules for AI frameworks will be provided.
- Code compiled on the ss11 UANs (user access nodes) will not run on the ss10 hosts. Similarly, ss10 UAN will not be able to compile code that runs on ss11 hosts. Jobs can be submitted from any UAN to any queue, but ss10 codes should be built on ss10 UANs and s11 codes on ss11 UANs.
Note: The Slingshot 11 upgrade will take place during the month before SC23. Please be aware when submitting your machine reservations.
If you have questions or concerns, please email support@alcf.anl.gov.