Failure analysis and prediction in compute clouds

Most cloud computing clusters are built from unreliable, commercial off-the-shelf components compared with supercomputer clusters. The high failure rates in their hardware and software components result in frequent node and application failures. Therefore, it is important to understand their failur...

Full description

Bibliographic Details
Main Author:	Chen, Xin
Language:	English
Published:	University of British Columbia 2014
Online Access:	http://hdl.handle.net/2429/50871

id	ndltd-UBC-oai-circle.library.ubc.ca-2429-50871
record_format	oai_dc
spelling	ndltd-UBC-oai-circle.library.ubc.ca-2429-508712018-01-05T17:27:47Z Failure analysis and prediction in compute clouds Chen, Xin Most cloud computing clusters are built from unreliable, commercial off-the-shelf components compared with supercomputer clusters. The high failure rates in their hardware and software components result in frequent node and application failures. Therefore, it is important to understand their failures to design a reliable cloud system. This thesis presents a characterization study of cloud application failures, and proposes a method to predict application failures in order to save resources. We first analyze a workload trace from a production cloud cluster and characterize the observed failures. The goal of our work is to improve the understanding of failures in compute clouds. We present the statistical properties of job and task failures, and attempt to correlate them with key scheduling constraints, node operations, and attributes of users in the cloud. We observe that there are many opportunities to enhance the reliability of the applications running in the cloud, and further nd that resource usage patterns of the jobs can be leveraged by failure prediction techniques. Next, we propose a prediction method based on recurrent neural networks to identify the failures. It takes the resource usage measurements or performance data, and generate features to categorize the applications into different classes. We then evaluate the method on the cloud workload trace. Our results show that the model is able to predict application failures. Moreover, we explore early classification to identify failures, and find that the prediction algorithm provides the cloud system enough time to take proactive actions much earlier than the termination of applications to avoid resource wastage. Applied Science, Faculty of Electrical and Computer Engineering, Department of Graduate 2014-10-23T22:37:28Z 2014-10-23T22:37:28Z 2014 2014-11 Text Thesis/Dissertation http://hdl.handle.net/2429/50871 eng Attribution-NonCommercial-NoDerivs 2.5 Canada http://creativecommons.org/licenses/by-nc-nd/2.5/ca/ University of British Columbia
collection	NDLTD
language	English
sources	NDLTD
description	Most cloud computing clusters are built from unreliable, commercial off-the-shelf components compared with supercomputer clusters. The high failure rates in their hardware and software components result in frequent node and application failures. Therefore, it is important to understand their failures to design a reliable cloud system. This thesis presents a characterization study of cloud application failures, and proposes a method to predict application failures in order to save resources. We first analyze a workload trace from a production cloud cluster and characterize the observed failures. The goal of our work is to improve the understanding of failures in compute clouds. We present the statistical properties of job and task failures, and attempt to correlate them with key scheduling constraints, node operations, and attributes of users in the cloud. We observe that there are many opportunities to enhance the reliability of the applications running in the cloud, and further nd that resource usage patterns of the jobs can be leveraged by failure prediction techniques. Next, we propose a prediction method based on recurrent neural networks to identify the failures. It takes the resource usage measurements or performance data, and generate features to categorize the applications into different classes. We then evaluate the method on the cloud workload trace. Our results show that the model is able to predict application failures. Moreover, we explore early classification to identify failures, and find that the prediction algorithm provides the cloud system enough time to take proactive actions much earlier than the termination of applications to avoid resource wastage. === Applied Science, Faculty of === Electrical and Computer Engineering, Department of === Graduate
author	Chen, Xin
spellingShingle	Chen, Xin Failure analysis and prediction in compute clouds
author_facet	Chen, Xin
author_sort	Chen, Xin
title	Failure analysis and prediction in compute clouds
title_short	Failure analysis and prediction in compute clouds
title_full	Failure analysis and prediction in compute clouds
title_fullStr	Failure analysis and prediction in compute clouds
title_full_unstemmed	Failure analysis and prediction in compute clouds
title_sort	failure analysis and prediction in compute clouds
publisher	University of British Columbia
publishDate	2014
url	http://hdl.handle.net/2429/50871
work_keys_str_mv	AT chenxin failureanalysisandpredictionincomputeclouds
_version_	1718584507268857856

Failure analysis and prediction in compute clouds

Similar Items