A fast and fault tolerant microkernel-based system for exa-scale computing (FFMK)

Hermann Hartig
Seminar

FFMK is a recently started project funded by DFG's Exascale-Software program. It addresses three key scalability obstacles expected in future exa-scale systems: the vulnerability to system failures due to transient or permanent failures, the performance losses due to imbalances and the noise due to unpredictable interactions between HPC applications and the operating system. To this end, we adapt and integrate well-proven technologies including:

• Microkernel-based operating systems (L4) to eliminate operating system noise impacts of feature-heavy all-in-one operating systems and to make kernel influences more deterministic and predictable,
• Erasure-code protected on-node checkpointing to provide a fast checkpoint and restart mechanism capable of keeping up with worsening mean-time between failures (MTBF), and
• Mathematically sound management system and load balancing algorithms (Mosix) to adjust the system to the highly dynamic and wide variety of requirements for today’s and future HPC applications.

FFMK will combine Linux running in a light-weight virtual machine with a special-purpose component for MPI, both running side by side on L4. The objective is to build a fluid self-organizing platform for applications that require scaling up to exa-scale performance. The talk will explain assumptions and overall architecture of FFMK and continue with presenting a number of design decisions the team is currently facing. FFMK is a cooperation between Hebrew University's MosiX team, the HPC centers of Berlin and Dresden (ZIB, ZIH) and TU Dresden's operating systems group.

Short Bio:
After having received his PhD from Karlsruhe University on an SMP-related topic, Hermann Härtig led a team at German National Research Center(GMD) to build BirliX, a Unix lookalike designed to address high security requirements. He then moved to TU Dresden to lead the operating systems chair. His team was among the pioneers in building micro kernels of the L4 family (Fiasco, Nova) and systems based on L4 (LeRE, DROPS, NIZZA). L4RE and Fiasco form the OS basis of the SIMKO 3 smart phone. Hermann Härtig now is PI for FFMK.