For most applications, profiling and analysis tools can be used to directly measure how subroutines and functions run on the hardware. Deep learning workflows add a layer of complexity as models are commonly coded using interpreted languages like Python, then compiled to a computational graph of operations at run-time. Thus profiling at the hardware level is more indirect, but important information on bottlenecks and inefficiencies can still be extracted by seeing how kernels are dispatched and run. Furthermore, profiling information from hardware counters can be correlated with higher-level analysis tools which provide visibility at the model and graph levels.
This presentation will show use of the Intel VTune Profiler to uncover how your DL model is running on Intel hardware. Analyzing mini-workflows with TensorFlow and PyTorch will be demonstrated, along with how profiles from the hardware-level match up with model and graph information from tools like TensorBoard. The focus will be on mini-workflows that you can use for experimentation to aid in learning the rich features of the profiling tools.
Authors
Christopher Lishka (presenting) is a Software Applications Engineer in the Intel Center of Excellence at Argonne National Laboratory. He helps scientists and engineers optimize their code, with a specialty in analyzing deep learning workflows with Intel profiling tools. At Intel he has worked in the Artificial Intelligence Products Group, Compiler Code Generation, Binary Translation, and Wearable Devices.
Nalini Kumar is an Application Engineer in the New Technology Enablement team in the HPC group at Intel. She works on performance analysis and optimization of machine learning applications in HPC environments. In the past, she has collaborated with the Big Data Center at Lawrence Berkeley National Laboratory on scaling Python machine learning codes on Cori. She also works on performance modeling for co-design at Intel.