
Combined MPI and CUDA parallelization of the CTQMC algorithm. Each GPU is paired with a CPU, which simulates many independent Markov Chains. The paired CPUs advance the logic of a single Markov Chain until it reaches the dominant computational task: the multiplication of many moderately sized matrices in the trace. This task is handed off to the GPU, while the CPU moves on to the next available Markov chain it controls, until handing that Markov.