Towards a resilience investigation framework for high performance computing

As large-scale scientific computing platforms increase in size and capability, their complexity also grows. These systems require great care and attention, much of which is due to the rise in failures from increased node/ component counts. Fault tolerance, or resilience, is a key challenge for compu...

Full description

Bibliographic Details
Main Author: Naughton, Thomas J.
Published: University of Reading 2014
Subjects:
Online Access:http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.658011