Automatic Scaling Hadoop in the Cloud for Efficient Process of Big Geospatial Data

Efficient processing of big geospatial data is crucial for tackling global and regional challenges such as climate change and natural disasters, but it is challenging not only due to the massive data volume but also due to the intrinsic complexity and high dimensions of the geospatial datasets. Whil...

Full description

Bibliographic Details
Main Authors: Zhenlong Li, Chaowei Yang, Kai Liu, Fei Hu, Baoxuan Jin
Format: Article
Language:English
Published: MDPI AG 2016-09-01
Series:ISPRS International Journal of Geo-Information
Subjects:
Online Access:http://www.mdpi.com/2220-9964/5/10/173
id doaj-8904e4e3cce9405e913fe54f6b9b5a8b
record_format Article
spelling doaj-8904e4e3cce9405e913fe54f6b9b5a8b2020-11-24T23:48:49ZengMDPI AGISPRS International Journal of Geo-Information2220-99642016-09-0151017310.3390/ijgi5100173ijgi5100173Automatic Scaling Hadoop in the Cloud for Efficient Process of Big Geospatial DataZhenlong Li0Chaowei Yang1Kai Liu2Fei Hu3Baoxuan Jin4Department of Geography, University of South Carolina, Columbia, SC 29208, USASpatiotemporal Innovation Center, George Mason University, Fairfax, VA 22030, USASpatiotemporal Innovation Center, George Mason University, Fairfax, VA 22030, USASpatiotemporal Innovation Center, George Mason University, Fairfax, VA 22030, USAYunnan Provincial Geomatics Center, Kunming 650034, ChinaEfficient processing of big geospatial data is crucial for tackling global and regional challenges such as climate change and natural disasters, but it is challenging not only due to the massive data volume but also due to the intrinsic complexity and high dimensions of the geospatial datasets. While traditional computing infrastructure does not scale well with the rapidly increasing data volume, Hadoop has attracted increasing attention in geoscience communities for handling big geospatial data. Recently, many studies were carried out to investigate adopting Hadoop for processing big geospatial data, but how to adjust the computing resources to efficiently handle the dynamic geoprocessing workload was barely explored. To bridge this gap, we propose a novel framework to automatically scale the Hadoop cluster in the cloud environment to allocate the right amount of computing resources based on the dynamic geoprocessing workload. The framework and auto-scaling algorithms are introduced, and a prototype system was developed to demonstrate the feasibility and efficiency of the proposed scaling mechanism using Digital Elevation Model (DEM) interpolation as an example. Experimental results show that this auto-scaling framework could (1) significantly reduce the computing resource utilization (by 80% in our example) while delivering similar performance as a full-powered cluster; and (2) effectively handle the spike processing workload by automatically increasing the computing resources to ensure the processing is finished within an acceptable time. Such an auto-scaling approach provides a valuable reference to optimize the performance of geospatial applications to address data- and computational-intensity challenges in GIScience in a more cost-efficient manner.http://www.mdpi.com/2220-9964/5/10/173geoprocessingcloud computingbig datageospatial cyberinfrastructureHadoop
collection DOAJ
language English
format Article
sources DOAJ
author Zhenlong Li
Chaowei Yang
Kai Liu
Fei Hu
Baoxuan Jin
spellingShingle Zhenlong Li
Chaowei Yang
Kai Liu
Fei Hu
Baoxuan Jin
Automatic Scaling Hadoop in the Cloud for Efficient Process of Big Geospatial Data
ISPRS International Journal of Geo-Information
geoprocessing
cloud computing
big data
geospatial cyberinfrastructure
Hadoop
author_facet Zhenlong Li
Chaowei Yang
Kai Liu
Fei Hu
Baoxuan Jin
author_sort Zhenlong Li
title Automatic Scaling Hadoop in the Cloud for Efficient Process of Big Geospatial Data
title_short Automatic Scaling Hadoop in the Cloud for Efficient Process of Big Geospatial Data
title_full Automatic Scaling Hadoop in the Cloud for Efficient Process of Big Geospatial Data
title_fullStr Automatic Scaling Hadoop in the Cloud for Efficient Process of Big Geospatial Data
title_full_unstemmed Automatic Scaling Hadoop in the Cloud for Efficient Process of Big Geospatial Data
title_sort automatic scaling hadoop in the cloud for efficient process of big geospatial data
publisher MDPI AG
series ISPRS International Journal of Geo-Information
issn 2220-9964
publishDate 2016-09-01
description Efficient processing of big geospatial data is crucial for tackling global and regional challenges such as climate change and natural disasters, but it is challenging not only due to the massive data volume but also due to the intrinsic complexity and high dimensions of the geospatial datasets. While traditional computing infrastructure does not scale well with the rapidly increasing data volume, Hadoop has attracted increasing attention in geoscience communities for handling big geospatial data. Recently, many studies were carried out to investigate adopting Hadoop for processing big geospatial data, but how to adjust the computing resources to efficiently handle the dynamic geoprocessing workload was barely explored. To bridge this gap, we propose a novel framework to automatically scale the Hadoop cluster in the cloud environment to allocate the right amount of computing resources based on the dynamic geoprocessing workload. The framework and auto-scaling algorithms are introduced, and a prototype system was developed to demonstrate the feasibility and efficiency of the proposed scaling mechanism using Digital Elevation Model (DEM) interpolation as an example. Experimental results show that this auto-scaling framework could (1) significantly reduce the computing resource utilization (by 80% in our example) while delivering similar performance as a full-powered cluster; and (2) effectively handle the spike processing workload by automatically increasing the computing resources to ensure the processing is finished within an acceptable time. Such an auto-scaling approach provides a valuable reference to optimize the performance of geospatial applications to address data- and computational-intensity challenges in GIScience in a more cost-efficient manner.
topic geoprocessing
cloud computing
big data
geospatial cyberinfrastructure
Hadoop
url http://www.mdpi.com/2220-9964/5/10/173
work_keys_str_mv AT zhenlongli automaticscalinghadoopinthecloudforefficientprocessofbiggeospatialdata
AT chaoweiyang automaticscalinghadoopinthecloudforefficientprocessofbiggeospatialdata
AT kailiu automaticscalinghadoopinthecloudforefficientprocessofbiggeospatialdata
AT feihu automaticscalinghadoopinthecloudforefficientprocessofbiggeospatialdata
AT baoxuanjin automaticscalinghadoopinthecloudforefficientprocessofbiggeospatialdata
_version_ 1725484356858806272