Extending the domain of transparent checkpoint-restart for large-scale HPC

While large-scale HPC systems are critical for expediting progress in many scientific fields, exascale computing will face severe resilience challenges. Checkpoint-restart is an important technology that large-scale applications continue to rely on to make forward progress in the presence of failure...

Full description

Bibliographic Details
Published:
Online Access:http://hdl.handle.net/2047/D20316244