In this webinar, we will cover three main topics to help researchers get started with ThetaGPU: (1) compiling and running, (2) profiling and performance analysis, and (3) AI and frameworks.
Topic 1: Compiling and Running
The focus of this session is to lay out all the necessary information for new users of the ThetaGPU supercomputer (a NVIDIA DGX A100 machine), from environment setup to compilation to job execution of a simulation and/or machine learning code. We will provide an overview of the hardware and pre-installed software libraries, including NVIDIA A100 GPUs, compute/service/login nodes, Cobalt job scheduler, and environment module system.
Topic 2: Profiling and Performance Tools
This session will cover NVIDIA's Nsight Systems and Nsight Compute tools. Nsight Systems provides developers a system-wide visualization of an application's performance. Developers can optimize bottlenecks to scale efficiently across any number or size of CPUs and GPUs on ThetaGPU. Nsight Compute is an interactive kernel profiler for CUDA applications. It provides detailed performance metrics and API debugging via a user interface and command line tool. Step-by-step guides for Nsight Systems and Nsight Compute will be presented with a quick demo on ThetaGPU.
Topic 3: AI and Frameworks
Part 3 will summarize the available software for ThetaGPU for machine learning, including conda, containers, scaling software, and performance tools.
About the Speakers
Kyle Gerard Felker is an Assistant Computational Scientist at Argonne National Laboratory in the Computational Science Division and is a member of the ALCF Catalyst team. Kyle currently works on applying deep learning to fusion energy science as a part of the Aurora Early Science Program. Previously, he held a postdoctoral appointment in the Leadership Computing Facility. He received his Ph.D. in Applied and Computational Mathematics from Princeton University, where he worked on numerical methods for astrophysics and helped develop the Athena++ astrophysics code. Kyle was a Department of Energy Computational Science Graduate Fellow (CSGF) from 2014-2018. He holds a B.A. in Physics and Mathematics from the University of Chicago.
JaeHyuk Kwack is a member of performance engineering group at Argonne Leadership Computing Facility. He received his B.S. and M.S. in engineering from Seoul National University, South Korea, and a Ph.D. in computational mechanics from University of Illinois at Urbana-Champaign, USA. Before joining Argonne, he had worked for Blue Waters supercomputing project at National Center for Supercomputing Applications. At Argonne since 2018, he has been working on performance tools and ensuring the readiness of several major scientific applications for performant use on the U.S. Department of Energy’s (DOE) forthcoming exascale system.
Corey Adams is an Assistant Computer Scientist with a joint appointment in the ALCF and Argonne's Physics Division. His research interests include fundamental physics, scalable deep learning, and high performance python and datascience software. Currently, he works on scaling algorithms for fundamental physics research with machine learning, including segmentation of high resolution particle physics datasets, sparse convolutional networks for 3D datasets, and surrogate models for nuclear theory calculations. He received his bachelor’s degrees in both physics and mathematics from the University of Rochester (2011) and subsequently his PhD. in Physics at Yale University (2016). After a postdoc at Harvard, stationed at Fermi National Accelerator Lab, he joined the datascience group in ALCF in 2018. He has had a joint appointment with the physics division since 2019.