Multilingual Geo-parsing Based on Free Wiki World Map

碩士 === 國立臺灣海洋大學 === 資訊工程學系 === 104 === Retrieve representative geographic location from texts is an interesting research problem. Researchers have tried to perform geo-tagging for texts retrieved from sources such as blog posts and Twitter tweets in the past. Most of these works have to tokenize tex...

Full description

Bibliographic Details
Main Authors: Huang, Yu-Ling, 黃郁菱
Other Authors: Huang, Chun-Ying
Format: Others
Language:zh-TW
Published: 2016
Online Access:http://ndltd.ncl.edu.tw/handle/spp76q
Description
Summary:碩士 === 國立臺灣海洋大學 === 資訊工程學系 === 104 === Retrieve representative geographic location from texts is an interesting research problem. Researchers have tried to perform geo-tagging for texts retrieved from sources such as blog posts and Twitter tweets in the past. Most of these works have to tokenize texts by using natural language processing techniques and then work with heuristic algorithms to identify geo-locations. However, these studies have to handle two critical challenges: the diversity of language and the granularity of identified geo-locations. While the former requires language-specific dictionaries or phrase databases, the coarse-granularity tagging does not fulfill users’ demand on identifying a more representative location for a given text. In this thesis, we attempt to develop a multilingual geo-tagging approach that solves the aforementioned challenges. Compared to the previous works, our approach does not rely on natural language processing technique to process inputs. Instead, we simply tokenize input texts using N-gram approach and then recognize geo-locations based on crowd contributed geographic map data. We further improve the granularity of our approach by considering additional geographic phrase features such as the length, the area size, and the relationships between candidate phrases. Based on these novel features, our approach is able to precisely identify representative locations for input texts of different languages without a dictionary. We evaluate our approach by using texts crawled from news websites, and the experiment results show that our proposed approach has achieved 96% and 92% correctness in Chinese and Japanese, respectively.