The Design of Fault Tolerance of Cluster Computing Platform

碩士 === 國立中山大學 === 資訊工程學系研究所 === 100 === If nodes got failed in a distributed application service, it will not only pay more cost to handle with these results missing, but also make scheduler cause additional loadings. For whole results don’t recalculated cause by fault occurs, it will be recalculate...

Full description

Bibliographic Details
Main Authors: Yu-tien Liao, 廖榆恬
Other Authors: Chun-Hung Lin
Format: Others
Language:zh-TW
Published: 2012
Online Access:http://ndltd.ncl.edu.tw/handle/62638831726857133954
id ndltd-TW-100NSYS5392057
record_format oai_dc
spelling ndltd-TW-100NSYS53920572015-10-13T21:22:19Z http://ndltd.ncl.edu.tw/handle/62638831726857133954 The Design of Fault Tolerance of Cluster Computing Platform 叢集計算之容錯設計 Yu-tien Liao 廖榆恬 碩士 國立中山大學 資訊工程學系研究所 100 If nodes got failed in a distributed application service, it will not only pay more cost to handle with these results missing, but also make scheduler cause additional loadings. For whole results don’t recalculated cause by fault occurs, it will be recalculated data of fault nodes in backup machines. Therefore, this paper uses three methods: N + N nodes, N + 1 nodes, and N + 1 nodes with probability to experiment and analyze their pros and cons, the third way gives jobs weight before assigning them, and converts weight into probability and nice value(defined by SLURM[1]) to influence scheduler’s decision of jobs’ order. When fault occurs, calculating in normal nodes’ results will back to control node, and then the fault node’s jobs are going to be reassigned or not be reassigned to backup machine for getting complete results. Finally, we will analyze these three ways good and bad. Chun-Hung Lin 林俊宏 2012 學位論文 ; thesis 70 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立中山大學 === 資訊工程學系研究所 === 100 === If nodes got failed in a distributed application service, it will not only pay more cost to handle with these results missing, but also make scheduler cause additional loadings. For whole results don’t recalculated cause by fault occurs, it will be recalculated data of fault nodes in backup machines. Therefore, this paper uses three methods: N + N nodes, N + 1 nodes, and N + 1 nodes with probability to experiment and analyze their pros and cons, the third way gives jobs weight before assigning them, and converts weight into probability and nice value(defined by SLURM[1]) to influence scheduler’s decision of jobs’ order. When fault occurs, calculating in normal nodes’ results will back to control node, and then the fault node’s jobs are going to be reassigned or not be reassigned to backup machine for getting complete results. Finally, we will analyze these three ways good and bad.
author2 Chun-Hung Lin
author_facet Chun-Hung Lin
Yu-tien Liao
廖榆恬
author Yu-tien Liao
廖榆恬
spellingShingle Yu-tien Liao
廖榆恬
The Design of Fault Tolerance of Cluster Computing Platform
author_sort Yu-tien Liao
title The Design of Fault Tolerance of Cluster Computing Platform
title_short The Design of Fault Tolerance of Cluster Computing Platform
title_full The Design of Fault Tolerance of Cluster Computing Platform
title_fullStr The Design of Fault Tolerance of Cluster Computing Platform
title_full_unstemmed The Design of Fault Tolerance of Cluster Computing Platform
title_sort design of fault tolerance of cluster computing platform
publishDate 2012
url http://ndltd.ncl.edu.tw/handle/62638831726857133954
work_keys_str_mv AT yutienliao thedesignoffaulttoleranceofclustercomputingplatform
AT liàoyútián thedesignoffaulttoleranceofclustercomputingplatform
AT yutienliao cóngjíjìsuànzhīróngcuòshèjì
AT liàoyútián cóngjíjìsuànzhīróngcuòshèjì
AT yutienliao designoffaulttoleranceofclustercomputingplatform
AT liàoyútián designoffaulttoleranceofclustercomputingplatform
_version_ 1718060513240285184