Clustering of Template Page for Data Extraction

碩士 === 國立中央大學 === 資訊工程學系在職專班 === 106 === In the field of Web Data Extraction, due to the diversity of web content and the complexity of the web page structure, the research of extracting data automatically from web pages of different template has always faced considerable challenges. The web data ex...

Full description

Bibliographic Details
Main Authors: Jia-Ru Wu, 吳佳儒
Other Authors: Chia-Hui Chang
Format: Others
Language:zh-TW
Published: 2018
Online Access:http://ndltd.ncl.edu.tw/handle/kq9cdn
id ndltd-TW-106NCU05392074
record_format oai_dc
spelling ndltd-TW-106NCU053920742019-11-28T05:22:16Z http://ndltd.ncl.edu.tw/handle/kq9cdn Clustering of Template Page for Data Extraction 樣板網頁結構自動分群 Jia-Ru Wu 吳佳儒 碩士 國立中央大學 資訊工程學系在職專班 106 In the field of Web Data Extraction, due to the diversity of web content and the complexity of the web page structure, the research of extracting data automatically from web pages of different template has always faced considerable challenges. The web data extraction system is mainly divided into two categories: Record Level and Page Level. Both input dataset use the web pages of the same template, and are used for data extraction and schema induction. Clustering research on web page of different template is rarely to be found. This paper proposes a method to do clustering automatically with the similarity of web page structure, and can simplify the problem of data extraction from different templates in web page. We also use the unsupervised and supervised clustering, which based on our designed features, and compare the performance of both clustering results. Although the overall clustering performance is not well as expected, the results of unsupervised clustering can reach a precision of 99% for the target cluster, a recall rate of approximately 78%. A precision of 97%, and a recall rate of more than 80% for supervised clustering. Finally, we can generate a complete web page schema and extract the POI-related information via Page-Level Information Extraction System (UWIDE) with this clustering result. It can also be accumulated into databases, to enhance the efficiency and quality of related value added services. Chia-Hui Chang 張嘉惠 2018 學位論文 ; thesis 42 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立中央大學 === 資訊工程學系在職專班 === 106 === In the field of Web Data Extraction, due to the diversity of web content and the complexity of the web page structure, the research of extracting data automatically from web pages of different template has always faced considerable challenges. The web data extraction system is mainly divided into two categories: Record Level and Page Level. Both input dataset use the web pages of the same template, and are used for data extraction and schema induction. Clustering research on web page of different template is rarely to be found. This paper proposes a method to do clustering automatically with the similarity of web page structure, and can simplify the problem of data extraction from different templates in web page. We also use the unsupervised and supervised clustering, which based on our designed features, and compare the performance of both clustering results. Although the overall clustering performance is not well as expected, the results of unsupervised clustering can reach a precision of 99% for the target cluster, a recall rate of approximately 78%. A precision of 97%, and a recall rate of more than 80% for supervised clustering. Finally, we can generate a complete web page schema and extract the POI-related information via Page-Level Information Extraction System (UWIDE) with this clustering result. It can also be accumulated into databases, to enhance the efficiency and quality of related value added services.
author2 Chia-Hui Chang
author_facet Chia-Hui Chang
Jia-Ru Wu
吳佳儒
author Jia-Ru Wu
吳佳儒
spellingShingle Jia-Ru Wu
吳佳儒
Clustering of Template Page for Data Extraction
author_sort Jia-Ru Wu
title Clustering of Template Page for Data Extraction
title_short Clustering of Template Page for Data Extraction
title_full Clustering of Template Page for Data Extraction
title_fullStr Clustering of Template Page for Data Extraction
title_full_unstemmed Clustering of Template Page for Data Extraction
title_sort clustering of template page for data extraction
publishDate 2018
url http://ndltd.ncl.edu.tw/handle/kq9cdn
work_keys_str_mv AT jiaruwu clusteringoftemplatepagefordataextraction
AT wújiārú clusteringoftemplatepagefordataextraction
AT jiaruwu yàngbǎnwǎngyèjiégòuzìdòngfēnqún
AT wújiārú yàngbǎnwǎngyèjiégòuzìdòngfēnqún
_version_ 1719297840329523200