Summary: | As Graphics Processing Units (GPUs) become more pervasive in High Performance Computing (HPC) and safety-critical domains, ensuring that GPU applications can be protected from data corruption grows in importance. The scaling of transistor sizes, operating voltages, and design margins has exacerbated the susceptibility of GPU devices to soft errors. Due to the random nature of these faults, predicting whether they will alter program output is a challenging problem.
Traditional methods of reliability estimation involve thousands of random fault injections per application, such that one random bit used by the program is flipped during its execution and the outcome is recorded. Despite prior efforts to mitigate errors, we still lack a clear understanding of how resilient these applications are in the presence of transient faults. As a consequence, the same level of fault protection is typically employed for all applications, regardless of their
resiliency. This adversely impacts the performance of the GPU applications, many times unnecessarily. Moreover, existing fault protection schemes cannot be tailored based on the reliability requirements of the applications. In this thesis, we tackle the above limitations of the prior work by building frameworks that can predict and mitigate faults in GPU applications. We develop a toolset that can identify micro-architecture agnostic characteristics in the program, based on the hardware
resources they stress during execution. Our study first aims to understand error propagation characteristics when faults are injected in scalar, versus vector, instructions. We then extend this study to build a more sophisticated learning-based framework, called PRISM, which enables us to predict failures in applications without running exhaustive fault injection campaigns, thereby reducing the error estimation effort. We leverage the insights provided by PRISM to develop a framework
called ArmorAll, a light-weight, intelligent, adaptive, and portable software solution to protect GPUs against soft errors. ArmorAll consists of a set of pure compiler-based redundancy schemes designed to optimize instruction duplication on GPUs, thereby enabling much more reliable execution. The choice of the scheme determines the subset of instructions that must be duplicated in an application, allowing adaptable fault coverage for different applications.
|