Clustering of Template Page for Data Extraction

碩士 === 國立中央大學 === 資訊工程學系在職專班 === 106 === In the field of Web Data Extraction, due to the diversity of web content and the complexity of the web page structure, the research of extracting data automatically from web pages of different template has always faced considerable challenges. The web data ex...

Full description

Bibliographic Details
Main Authors:	Jia-Ru Wu, 吳佳儒
Other Authors:	Chia-Hui Chang
Format:	Others
Language:	zh-TW
Published:	2018
Online Access:	http://ndltd.ncl.edu.tw/handle/kq9cdn

id	ndltd-TW-106NCU05392074
record_format	oai_dc
spelling	ndltd-TW-106NCU053920742019-11-28T05:22:16Z http://ndltd.ncl.edu.tw/handle/kq9cdn Clustering of Template Page for Data Extraction 樣板網頁結構自動分群 Jia-Ru Wu 吳佳儒碩士國立中央大學資訊工程學系在職專班 106 In the field of Web Data Extraction, due to the diversity of web content and the complexity of the web page structure, the research of extracting data automatically from web pages of different template has always faced considerable challenges. The web data extraction system is mainly divided into two categories: Record Level and Page Level. Both input dataset use the web pages of the same template, and are used for data extraction and schema induction. Clustering research on web page of different template is rarely to be found. This paper proposes a method to do clustering automatically with the similarity of web page structure, and can simplify the problem of data extraction from different templates in web page. We also use the unsupervised and supervised clustering, which based on our designed features, and compare the performance of both clustering results. Although the overall clustering performance is not well as expected, the results of unsupervised clustering can reach a precision of 99% for the target cluster, a recall rate of approximately 78%. A precision of 97%, and a recall rate of more than 80% for supervised clustering. Finally, we can generate a complete web page schema and extract the POI-related information via Page-Level Information Extraction System (UWIDE) with this clustering result. It can also be accumulated into databases, to enhance the efficiency and quality of related value added services. Chia-Hui Chang 張嘉惠 2018 學位論文 ; thesis 42 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	碩士 === 國立中央大學 === 資訊工程學系在職專班 === 106 === In the field of Web Data Extraction, due to the diversity of web content and the complexity of the web page structure, the research of extracting data automatically from web pages of different template has always faced considerable challenges. The web data extraction system is mainly divided into two categories: Record Level and Page Level. Both input dataset use the web pages of the same template, and are used for data extraction and schema induction. Clustering research on web page of different template is rarely to be found. This paper proposes a method to do clustering automatically with the similarity of web page structure, and can simplify the problem of data extraction from different templates in web page. We also use the unsupervised and supervised clustering, which based on our designed features, and compare the performance of both clustering results. Although the overall clustering performance is not well as expected, the results of unsupervised clustering can reach a precision of 99% for the target cluster, a recall rate of approximately 78%. A precision of 97%, and a recall rate of more than 80% for supervised clustering. Finally, we can generate a complete web page schema and extract the POI-related information via Page-Level Information Extraction System (UWIDE) with this clustering result. It can also be accumulated into databases, to enhance the efficiency and quality of related value added services.
author2	Chia-Hui Chang
author_facet	Chia-Hui Chang Jia-Ru Wu 吳佳儒
author	Jia-Ru Wu 吳佳儒
spellingShingle	Jia-Ru Wu 吳佳儒 Clustering of Template Page for Data Extraction
author_sort	Jia-Ru Wu
title	Clustering of Template Page for Data Extraction
title_short	Clustering of Template Page for Data Extraction
title_full	Clustering of Template Page for Data Extraction
title_fullStr	Clustering of Template Page for Data Extraction
title_full_unstemmed	Clustering of Template Page for Data Extraction
title_sort	clustering of template page for data extraction
publishDate	2018
url	http://ndltd.ncl.edu.tw/handle/kq9cdn
work_keys_str_mv	AT jiaruwu clusteringoftemplatepagefordataextraction AT wújiārú clusteringoftemplatepagefordataextraction AT jiaruwu yàngbǎnwǎngyèjiégòuzìdòngfēnqún AT wújiārú yàngbǎnwǎngyèjiégòuzìdòngfēnqún
_version_	1719297840329523200

Clustering of Template Page for Data Extraction

Similar Items