A Diskless Checkpointing Algorithm for Cluster Architectures Applied to Geospatial Raster Data Processing

In recent years, due to the increasing calculation demands for the massive spatial data analysis, the parallel computing based on high-performance computers has become an inevitable trend of geospatial raster data processing, such as digital terrain analysis (DTA for short), remote sensing interpret...

Full description

Bibliographic Details
Main Authors: Xiaodong Song, Wanfeng Dou, Guoan Tang, Kun Yang, Kejian Qian
Format: Article
Language:English
Published: SAGE Publishing 2014-12-01
Series:Journal of Algorithms & Computational Technology
Online Access:https://doi.org/10.1260/1748-3018.8.4.369
Description
Summary:In recent years, due to the increasing calculation demands for the massive spatial data analysis, the parallel computing based on high-performance computers has become an inevitable trend of geospatial raster data processing, such as digital terrain analysis (DTA for short), remote sensing interpretation and digital soil mapping. A key problem is how to design a fault-tolerant software to enhance the stability and robustness of scientific application. This paper presents an approach of failure recovery for distributed memory parallel computing. Furthermore, we adopt the master/slave programming model and present a framework of redundant master mode, which the failure occurring on the master node could not lead to a breakdown of the whole system. This approach schedules the failing task by dividing all the failing data into several partitions according to the calculating scale of failure. By means of the Fault-Tolerant Granularity Model, the scheduling algorithm can assign the failing task dynamically. Finally, taking example of digital terrain analysis, two experiments are discussed that based on the data size and the number of failures. Simulation results indicate that the proposed scheduling algorithm based on Fault-Tolerant Granularity Model achieves lower fault tolerance overhead than the rollback recovery scheme when several processors fail simultaneously.
ISSN:1748-3018
1748-3026