GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources

碩士 === 國立中央大學 === 土木工程學系 === 104 === With the advance of the World-Wide Web (WWW) technology, people can easily share content on the Web, including geospatial data and web services. While geospatial resources are being published at an ever-increasing speed, the "big geospatial data management&q...

Full description

Bibliographic Details
Main Authors: Hao Chang, 張皓
Other Authors: Chih-Yuan Huang
Format: Others
Language:en_US
Published: 2016
Online Access:http://ndltd.ncl.edu.tw/handle/3vtz8y
id ndltd-TW-104NCU05015073
record_format oai_dc
spelling ndltd-TW-104NCU050150732019-05-15T23:01:21Z http://ndltd.ncl.edu.tw/handle/3vtz8y GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources 地理網路爬蟲:具擴充及擴展性之地理網路資源爬行架構 Hao Chang 張皓 碩士 國立中央大學 土木工程學系 104 With the advance of the World-Wide Web (WWW) technology, people can easily share content on the Web, including geospatial data and web services. While geospatial resources are being published at an ever-increasing speed, the "big geospatial data management" issues start attracting attention. Among the big geospatial data issues, this research focuses on discovering distributed geospatial resources. As resources are scattered on the globally distributed WWW, users are facing difficulties in finding the resources they need. While the WWW has Web search engines addressing web resource discovery issues, we envision that the geospatial Web (i.e., GeoWeb) also requires GeoWeb search engines for users to efficiently find GeoWeb resources. To realize a GeoWeb search engine, one of the first steps is to proactively discover GeoWeb resources on the WWW. Hence, in this study, we propose the GeoWeb Crawler, an extensible Web crawling framework that can find various types of GeoWeb resources, such as Open Geospatial Consortium (OGC) web services, Keyhole Markup Language (KML) and ESRI Shapefiles. In addition, to promote the performance of the GeoWeb Crawler, we apply the distributed computing concept in the framework to easily scale horizontally. By using 8 machines, we had 13 times performance improvement on the crawling process. Furthermore, while regular web crawlers are ideal for discovering resources with hyperlinks, the GeoWeb Crawler should customize connectors to find the resources hidden behind open or proprietary web services. The result shows that for 10 targeted open-standard-based resource types and 3 non-open-standard-based resource types, the GeoWeb Crawler discovered 7,351 geospatial services, and 194,003 datasets, which are 3.8 to 47.5 times more than what users can find with existing approaches. Based on the crawling level distribution of discovered resources, the result indicates that Google search provide us good seeds to discover resources efficiently. However, the deeper levels we crawl, the more unnecessary effort we spend. Based on the proposed solution, we built the GeoWeb search engine prototype, GeoHub. According to the experimental result, the proposed GeoWeb Crawler framework is proven to be extensible and scalable to provide comprehensive index of GeoWeb. Chih-Yuan Huang 黃智遠 2016 學位論文 ; thesis 48 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 碩士 === 國立中央大學 === 土木工程學系 === 104 === With the advance of the World-Wide Web (WWW) technology, people can easily share content on the Web, including geospatial data and web services. While geospatial resources are being published at an ever-increasing speed, the "big geospatial data management" issues start attracting attention. Among the big geospatial data issues, this research focuses on discovering distributed geospatial resources. As resources are scattered on the globally distributed WWW, users are facing difficulties in finding the resources they need. While the WWW has Web search engines addressing web resource discovery issues, we envision that the geospatial Web (i.e., GeoWeb) also requires GeoWeb search engines for users to efficiently find GeoWeb resources. To realize a GeoWeb search engine, one of the first steps is to proactively discover GeoWeb resources on the WWW. Hence, in this study, we propose the GeoWeb Crawler, an extensible Web crawling framework that can find various types of GeoWeb resources, such as Open Geospatial Consortium (OGC) web services, Keyhole Markup Language (KML) and ESRI Shapefiles. In addition, to promote the performance of the GeoWeb Crawler, we apply the distributed computing concept in the framework to easily scale horizontally. By using 8 machines, we had 13 times performance improvement on the crawling process. Furthermore, while regular web crawlers are ideal for discovering resources with hyperlinks, the GeoWeb Crawler should customize connectors to find the resources hidden behind open or proprietary web services. The result shows that for 10 targeted open-standard-based resource types and 3 non-open-standard-based resource types, the GeoWeb Crawler discovered 7,351 geospatial services, and 194,003 datasets, which are 3.8 to 47.5 times more than what users can find with existing approaches. Based on the crawling level distribution of discovered resources, the result indicates that Google search provide us good seeds to discover resources efficiently. However, the deeper levels we crawl, the more unnecessary effort we spend. Based on the proposed solution, we built the GeoWeb search engine prototype, GeoHub. According to the experimental result, the proposed GeoWeb Crawler framework is proven to be extensible and scalable to provide comprehensive index of GeoWeb.
author2 Chih-Yuan Huang
author_facet Chih-Yuan Huang
Hao Chang
張皓
author Hao Chang
張皓
spellingShingle Hao Chang
張皓
GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources
author_sort Hao Chang
title GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources
title_short GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources
title_full GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources
title_fullStr GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources
title_full_unstemmed GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources
title_sort geoweb crawler: an extensible and scalable web crawling framework for discovering geospatial web resources
publishDate 2016
url http://ndltd.ncl.edu.tw/handle/3vtz8y
work_keys_str_mv AT haochang geowebcrawleranextensibleandscalablewebcrawlingframeworkfordiscoveringgeospatialwebresources
AT zhānghào geowebcrawleranextensibleandscalablewebcrawlingframeworkfordiscoveringgeospatialwebresources
AT haochang delǐwǎnglùpáchóngjùkuòchōngjíkuòzhǎnxìngzhīdelǐwǎnglùzīyuánpáxíngjiàgòu
AT zhānghào delǐwǎnglùpáchóngjùkuòchōngjíkuòzhǎnxìngzhīdelǐwǎnglùzīyuánpáxíngjiàgòu
_version_ 1719138769208082432