HPC platforms are moving towards massively parallel architectures, with denser multi-core nodes and reduced memory per core unit. Application designers are thus constrained to rethink conventional programming models, such as exclusive usage of processes to express all degrees of parallelism, which incur expensive data movements and large memory consumption. MPI+X, where X often designate a threading programming model such as OpenMP, are becoming commonly used to alleviate these issues by sharing data within the same address space while moving data between processes using MPI.
We first demonstrate how a MPI+Threads model naturally fits and help scale the Breath First Search (BFS) algorithm. In addition, we discuss our optimization efforts to reduce algorithmic complexity that scales linearly or quadratically with the number of processes, and then scale our algorithm to 512K cores on a BG/Q system.
We also show that a model that relies on multithreaded concurrent communication can hit performance barriers due to the contention in the MPI runtime. We demonstrate that, on hierarchical memory architectures, the contention is highly affected by unfair arbitration if the MPI runtime relies on a mutex-based critical section. We then provide an in-depth analysis of the critical section arbitration in MPICH and its influence on communication performance. We follow by solutions implementing fair arbitration and a custom lock that favors threads doing useful work. Our evaluation with several benchmarks and applications showed up to 4x improvements over the original mutex-based design.
Bio:
Abdelhalim Amer is Ph.D candidate at Tokyo Insitute of Technology, under the supervison of Prof. Satoshi Matsuoka. His research topic facuses on threading models on massively parallel systems. He addresses thread-centric programming challenges such as the data locality and parallelism trade-off in task parallel environements, and the interation between threads and communication runtimes. In particular, he is working with Dr. Pavan Balaji on the interaction between threading models and the MPI communication runtime.