Extending the domain of transparent checkpoint-restart for large-scale HPC
While large-scale HPC systems are critical for expediting progress in many scientific fields, exascale computing will face severe resilience challenges. Checkpoint-restart is an important technology that large-scale applications continue to rely on to make forward progress in the presence of failure...
Published: |
|
---|---|
Online Access: | http://hdl.handle.net/2047/D20316244 |