Failure analysis and prediction in compute clouds

Most cloud computing clusters are built from unreliable, commercial off-the-shelf components compared with supercomputer clusters. The high failure rates in their hardware and software components result in frequent node and application failures. Therefore, it is important to understand their failur...

Full description

Bibliographic Details
Main Author: Chen, Xin
Language:English
Published: University of British Columbia 2014
Online Access:http://hdl.handle.net/2429/50871
id ndltd-UBC-oai-circle.library.ubc.ca-2429-50871
record_format oai_dc
spelling ndltd-UBC-oai-circle.library.ubc.ca-2429-508712018-01-05T17:27:47Z Failure analysis and prediction in compute clouds Chen, Xin Most cloud computing clusters are built from unreliable, commercial off-the-shelf components compared with supercomputer clusters. The high failure rates in their hardware and software components result in frequent node and application failures. Therefore, it is important to understand their failures to design a reliable cloud system. This thesis presents a characterization study of cloud application failures, and proposes a method to predict application failures in order to save resources. We first analyze a workload trace from a production cloud cluster and characterize the observed failures. The goal of our work is to improve the understanding of failures in compute clouds. We present the statistical properties of job and task failures, and attempt to correlate them with key scheduling constraints, node operations, and attributes of users in the cloud. We observe that there are many opportunities to enhance the reliability of the applications running in the cloud, and further nd that resource usage patterns of the jobs can be leveraged by failure prediction techniques. Next, we propose a prediction method based on recurrent neural networks to identify the failures. It takes the resource usage measurements or performance data, and generate features to categorize the applications into different classes. We then evaluate the method on the cloud workload trace. Our results show that the model is able to predict application failures. Moreover, we explore early classification to identify failures, and find that the prediction algorithm provides the cloud system enough time to take proactive actions much earlier than the termination of applications to avoid resource wastage. Applied Science, Faculty of Electrical and Computer Engineering, Department of Graduate 2014-10-23T22:37:28Z 2014-10-23T22:37:28Z 2014 2014-11 Text Thesis/Dissertation http://hdl.handle.net/2429/50871 eng Attribution-NonCommercial-NoDerivs 2.5 Canada http://creativecommons.org/licenses/by-nc-nd/2.5/ca/ University of British Columbia
collection NDLTD
language English
sources NDLTD
description Most cloud computing clusters are built from unreliable, commercial off-the-shelf components compared with supercomputer clusters. The high failure rates in their hardware and software components result in frequent node and application failures. Therefore, it is important to understand their failures to design a reliable cloud system. This thesis presents a characterization study of cloud application failures, and proposes a method to predict application failures in order to save resources. We first analyze a workload trace from a production cloud cluster and characterize the observed failures. The goal of our work is to improve the understanding of failures in compute clouds. We present the statistical properties of job and task failures, and attempt to correlate them with key scheduling constraints, node operations, and attributes of users in the cloud. We observe that there are many opportunities to enhance the reliability of the applications running in the cloud, and further nd that resource usage patterns of the jobs can be leveraged by failure prediction techniques. Next, we propose a prediction method based on recurrent neural networks to identify the failures. It takes the resource usage measurements or performance data, and generate features to categorize the applications into different classes. We then evaluate the method on the cloud workload trace. Our results show that the model is able to predict application failures. Moreover, we explore early classification to identify failures, and find that the prediction algorithm provides the cloud system enough time to take proactive actions much earlier than the termination of applications to avoid resource wastage. === Applied Science, Faculty of === Electrical and Computer Engineering, Department of === Graduate
author Chen, Xin
spellingShingle Chen, Xin
Failure analysis and prediction in compute clouds
author_facet Chen, Xin
author_sort Chen, Xin
title Failure analysis and prediction in compute clouds
title_short Failure analysis and prediction in compute clouds
title_full Failure analysis and prediction in compute clouds
title_fullStr Failure analysis and prediction in compute clouds
title_full_unstemmed Failure analysis and prediction in compute clouds
title_sort failure analysis and prediction in compute clouds
publisher University of British Columbia
publishDate 2014
url http://hdl.handle.net/2429/50871
work_keys_str_mv AT chenxin failureanalysisandpredictionincomputeclouds
_version_ 1718584507268857856