GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources
碩士 === 國立中央大學 === 土木工程學系 === 104 === With the advance of the World-Wide Web (WWW) technology, people can easily share content on the Web, including geospatial data and web services. While geospatial resources are being published at an ever-increasing speed, the "big geospatial data management&q...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2016
|
Online Access: | http://ndltd.ncl.edu.tw/handle/3vtz8y |
id |
ndltd-TW-104NCU05015073 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-104NCU050150732019-05-15T23:01:21Z http://ndltd.ncl.edu.tw/handle/3vtz8y GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources 地理網路爬蟲:具擴充及擴展性之地理網路資源爬行架構 Hao Chang 張皓 碩士 國立中央大學 土木工程學系 104 With the advance of the World-Wide Web (WWW) technology, people can easily share content on the Web, including geospatial data and web services. While geospatial resources are being published at an ever-increasing speed, the "big geospatial data management" issues start attracting attention. Among the big geospatial data issues, this research focuses on discovering distributed geospatial resources. As resources are scattered on the globally distributed WWW, users are facing difficulties in finding the resources they need. While the WWW has Web search engines addressing web resource discovery issues, we envision that the geospatial Web (i.e., GeoWeb) also requires GeoWeb search engines for users to efficiently find GeoWeb resources. To realize a GeoWeb search engine, one of the first steps is to proactively discover GeoWeb resources on the WWW. Hence, in this study, we propose the GeoWeb Crawler, an extensible Web crawling framework that can find various types of GeoWeb resources, such as Open Geospatial Consortium (OGC) web services, Keyhole Markup Language (KML) and ESRI Shapefiles. In addition, to promote the performance of the GeoWeb Crawler, we apply the distributed computing concept in the framework to easily scale horizontally. By using 8 machines, we had 13 times performance improvement on the crawling process. Furthermore, while regular web crawlers are ideal for discovering resources with hyperlinks, the GeoWeb Crawler should customize connectors to find the resources hidden behind open or proprietary web services. The result shows that for 10 targeted open-standard-based resource types and 3 non-open-standard-based resource types, the GeoWeb Crawler discovered 7,351 geospatial services, and 194,003 datasets, which are 3.8 to 47.5 times more than what users can find with existing approaches. Based on the crawling level distribution of discovered resources, the result indicates that Google search provide us good seeds to discover resources efficiently. However, the deeper levels we crawl, the more unnecessary effort we spend. Based on the proposed solution, we built the GeoWeb search engine prototype, GeoHub. According to the experimental result, the proposed GeoWeb Crawler framework is proven to be extensible and scalable to provide comprehensive index of GeoWeb. Chih-Yuan Huang 黃智遠 2016 學位論文 ; thesis 48 en_US |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立中央大學 === 土木工程學系 === 104 === With the advance of the World-Wide Web (WWW) technology, people can easily share content on the Web, including geospatial data and web services. While geospatial resources are being published at an ever-increasing speed, the "big geospatial data management" issues start attracting attention. Among the big geospatial data issues, this research focuses on discovering distributed geospatial resources. As resources are scattered on the globally distributed WWW, users are facing difficulties in finding the resources they need. While the WWW has Web search engines addressing web resource discovery issues, we envision that the geospatial Web (i.e., GeoWeb) also requires GeoWeb search engines for users to efficiently find GeoWeb resources. To realize a GeoWeb search engine, one of the first steps is to proactively discover GeoWeb resources on the WWW. Hence, in this study, we propose the GeoWeb Crawler, an extensible Web crawling framework that can find various types of GeoWeb resources, such as Open Geospatial Consortium (OGC) web services, Keyhole Markup Language (KML) and ESRI Shapefiles. In addition, to promote the performance of the GeoWeb Crawler, we apply the distributed computing concept in the framework to easily scale horizontally. By using 8 machines, we had 13 times performance improvement on the crawling process. Furthermore, while regular web crawlers are ideal for discovering resources with hyperlinks, the GeoWeb Crawler should customize connectors to find the resources hidden behind open or proprietary web services. The result shows that for 10 targeted open-standard-based resource types and 3 non-open-standard-based resource types, the GeoWeb Crawler discovered 7,351 geospatial services, and 194,003 datasets, which are 3.8 to 47.5 times more than what users can find with existing approaches. Based on the crawling level distribution of discovered resources, the result indicates that Google search provide us good seeds to discover resources efficiently. However, the deeper levels we crawl, the more unnecessary effort we spend. Based on the proposed solution, we built the GeoWeb search engine prototype, GeoHub. According to the experimental result, the proposed GeoWeb Crawler framework is proven to be extensible and scalable to provide comprehensive index of GeoWeb.
|
author2 |
Chih-Yuan Huang |
author_facet |
Chih-Yuan Huang Hao Chang 張皓 |
author |
Hao Chang 張皓 |
spellingShingle |
Hao Chang 張皓 GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources |
author_sort |
Hao Chang |
title |
GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources |
title_short |
GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources |
title_full |
GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources |
title_fullStr |
GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources |
title_full_unstemmed |
GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources |
title_sort |
geoweb crawler: an extensible and scalable web crawling framework for discovering geospatial web resources |
publishDate |
2016 |
url |
http://ndltd.ncl.edu.tw/handle/3vtz8y |
work_keys_str_mv |
AT haochang geowebcrawleranextensibleandscalablewebcrawlingframeworkfordiscoveringgeospatialwebresources AT zhānghào geowebcrawleranextensibleandscalablewebcrawlingframeworkfordiscoveringgeospatialwebresources AT haochang delǐwǎnglùpáchóngjùkuòchōngjíkuòzhǎnxìngzhīdelǐwǎnglùzīyuánpáxíngjiàgòu AT zhānghào delǐwǎnglùpáchóngjùkuòchōngjíkuòzhǎnxìngzhīdelǐwǎnglùzīyuánpáxíngjiàgòu |
_version_ |
1719138769208082432 |