Towards a resilience investigation framework for high performance computing
As large-scale scientific computing platforms increase in size and capability, their complexity also grows. These systems require great care and attention, much of which is due to the rise in failures from increased node/ component counts. Fault tolerance, or resilience, is a key challenge for compu...
Main Author: | |
---|---|
Published: |
University of Reading
2014
|
Subjects: | |
Online Access: | http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.658011 |