A Resource-Conscious MPI Message Queue Architecture for Very Large-Scale Jobs

Judi Zounvemo
Seminar

To guarantee message queue avoidance, without the painful compromise of dropping eager messages, the concept of reception must be totally absent; that is, messages should always be able to reach their final destination without any action from the receiving process. However, in the Message Passing Interface (MPI), even the one-sided communication model cannot always internally offer such a guarantee at the implementation level; especially, for non-contiguous data types or remote accumulation. In fact, message queuing is so frequently used in MPI that it has been tagged by some authors as its most crucial data structure. Therefore, ensuring fast message queues at large scales is of paramount importance. Scalability, however, is two-fold. With the growing processor core density per node, and the expected lower memory density per core at larger scales, a queue mechanism that is blind on memory consumption behavior could be as harmful as one that quickly becomes unacceptably slow when job sizes grow. More importantly, scalability is weakness-bound. If a message queue architecture can remain reasonably fast at 10 million CPU cores while its memory consumption becomes prohibitive at 0.1 million CPU cores, then its scalability is effectively limited to 0.1 million CPU cores.

In this talk, I present a multidimensional MPI message queue management mechanism which exploits rank decomposition to considerably mitigate the effects of job size on both speed of operation and memory consumption. I compare the behavior of the proposed design with two other reference message queue approaches which perform extremely well with respect to either memory or speed scalability but could become quickly unusable for large jobs for not being two-fold scalable. I show in particular that the proposed design is able to not only offer unbounded message queue processing speedup, but in certain situations, even outperform the reference memory-scalable message queue approach in terms of memory footprint.