Parallel applications and programming models are increasingly relying on threading models to share node resources to cope with the growing core-density and the scarce per-core resources. To achieve high scalability on such systems, exposing massive concurrency is essential to fully utilize the computational resources and hide communication and data movement latencies (memory accesses, network communication, I/O, etc.). Unfortunately, kernel threads are heavy due to their high scheduling costs. As a result, user-level-based threading (ULT) runtimes are becoming popular as an alternative to kernel threads because of their lightweight nature. Despite their advantages, however, ULTs can still incur expensive scheduling costs in extremely fine-grained execution environments. Although, a large body of work has been dedicated to ULTs, there exist no lower-bounds on those scheduling costs in the context of modern multi- and many-core systems. In addition, it is not clear as to what features the ULT runtime has to expose to applications and programming models in order to approach these bounds without losing the required functionality.
In this work, we perform an in-depth characterization of the Argobots ULT runtime and derive lower-bounds on context-switching costs which constitute a large portion of scheduling overheads. In addition, we show that using a subset of Argobots’s features that are sufficient for certain execution environments can reduce the scheduling overhead by up to 67% with a minimalistic runtime. Furthermore, we demonstrate how exposing a thread-to-thread context-switching ability to the user can further reduce scheduling costs.