Clustering of Template Page for Data Extraction
碩士 === 國立中央大學 === 資訊工程學系在職專班 === 106 === In the field of Web Data Extraction, due to the diversity of web content and the complexity of the web page structure, the research of extracting data automatically from web pages of different template has always faced considerable challenges. The web data ex...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | zh-TW |
Published: |
2018
|
Online Access: | http://ndltd.ncl.edu.tw/handle/kq9cdn |
id |
ndltd-TW-106NCU05392074 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-106NCU053920742019-11-28T05:22:16Z http://ndltd.ncl.edu.tw/handle/kq9cdn Clustering of Template Page for Data Extraction 樣板網頁結構自動分群 Jia-Ru Wu 吳佳儒 碩士 國立中央大學 資訊工程學系在職專班 106 In the field of Web Data Extraction, due to the diversity of web content and the complexity of the web page structure, the research of extracting data automatically from web pages of different template has always faced considerable challenges. The web data extraction system is mainly divided into two categories: Record Level and Page Level. Both input dataset use the web pages of the same template, and are used for data extraction and schema induction. Clustering research on web page of different template is rarely to be found. This paper proposes a method to do clustering automatically with the similarity of web page structure, and can simplify the problem of data extraction from different templates in web page. We also use the unsupervised and supervised clustering, which based on our designed features, and compare the performance of both clustering results. Although the overall clustering performance is not well as expected, the results of unsupervised clustering can reach a precision of 99% for the target cluster, a recall rate of approximately 78%. A precision of 97%, and a recall rate of more than 80% for supervised clustering. Finally, we can generate a complete web page schema and extract the POI-related information via Page-Level Information Extraction System (UWIDE) with this clustering result. It can also be accumulated into databases, to enhance the efficiency and quality of related value added services. Chia-Hui Chang 張嘉惠 2018 學位論文 ; thesis 42 zh-TW |
collection |
NDLTD |
language |
zh-TW |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立中央大學 === 資訊工程學系在職專班 === 106 === In the field of Web Data Extraction, due to the diversity of web content and the complexity of the web page structure, the research of extracting data automatically from web pages of different template has always faced considerable challenges. The web data extraction system is
mainly divided into two categories: Record Level and Page Level. Both input dataset use the web pages of the same template, and are used for data extraction and schema induction. Clustering research on web page of different template is rarely to be found.
This paper proposes a method to do clustering automatically with the similarity of web page structure, and can simplify the problem of data extraction from different templates in web page. We also use the unsupervised and supervised clustering, which based on our designed features, and compare the performance of both clustering results. Although the overall clustering performance is not well as expected, the results of unsupervised clustering can reach
a precision of 99% for the target cluster, a recall rate of approximately 78%. A precision of 97%, and a recall rate of more than 80% for supervised clustering.
Finally, we can generate a complete web page schema and extract the POI-related information via Page-Level Information Extraction System (UWIDE) with this clustering result. It can also be accumulated into databases, to enhance the efficiency and quality of related value
added services.
|
author2 |
Chia-Hui Chang |
author_facet |
Chia-Hui Chang Jia-Ru Wu 吳佳儒 |
author |
Jia-Ru Wu 吳佳儒 |
spellingShingle |
Jia-Ru Wu 吳佳儒 Clustering of Template Page for Data Extraction |
author_sort |
Jia-Ru Wu |
title |
Clustering of Template Page for Data Extraction |
title_short |
Clustering of Template Page for Data Extraction |
title_full |
Clustering of Template Page for Data Extraction |
title_fullStr |
Clustering of Template Page for Data Extraction |
title_full_unstemmed |
Clustering of Template Page for Data Extraction |
title_sort |
clustering of template page for data extraction |
publishDate |
2018 |
url |
http://ndltd.ncl.edu.tw/handle/kq9cdn |
work_keys_str_mv |
AT jiaruwu clusteringoftemplatepagefordataextraction AT wújiārú clusteringoftemplatepagefordataextraction AT jiaruwu yàngbǎnwǎngyèjiégòuzìdòngfēnqún AT wújiārú yàngbǎnwǎngyèjiégòuzìdòngfēnqún |
_version_ |
1719297840329523200 |