Summary: | In this article, we propose a novel scheme for diagnosing intermittent faults for cloud systems. We have investigated the characteristic of high-level symptomatic behavior on top of a cloud system and identified that (1) arrival counts of high-level symptoms go up with the number of fault injections at different speeds, which may help us to differentiate one fault model from another; (2) the nested level of fatal traps is found to be an indicative of fault duration, which is helpful for fault model diagnosis; (3) fatal traps triggered by certain faulty units is explored, providing useful information for locating faults. Based on these features, an n-dimensional space taking symptom’s arrival rate (grown up skew of the arrival count) as each dimension, which formulates the diagnosis problem as a pattern recognition problem is defined. Then, a backpropagation neural-network-based online hardware fault diagnosis scheme is proposed. Experimental results show that diagnosis accuracy of fault location is 99.2%, the accuracy of fault model is 96.7%, and the latency is affordable. This scheme has been implemented in firmware so that it covers cloud software stacks (virtual machine monitor, virtual machines, and user applications) and incurs zero hardware overhead.
|