Page-level Information Extraction System

碩士 === 國立中央大學 === 軟體工程研究所 === 103 === The problem of web data extraction has been studied more than ten years. Because of the structural complexity and diversity in web pages, existing researches are limited to record-level data extraction. Beside, demand of extracting data from large amount of web...

Full description

Bibliographic Details
Main Authors:	Jhong-li Ding, 丁中立
Other Authors:	Chia-hui Chang
Format:	Others
Language:	zh-TW
Published:	2015
Online Access:	http://ndltd.ncl.edu.tw/handle/80764607998088201775

id	ndltd-TW-103NCU05392081
record_format	oai_dc
spelling	ndltd-TW-103NCU053920812016-08-17T04:23:14Z http://ndltd.ncl.edu.tw/handle/80764607998088201775 Page-level Information Extraction System 網頁層級資料擷取系統 Jhong-li Ding 丁中立碩士國立中央大學軟體工程研究所 103 The problem of web data extraction has been studied more than ten years. Because of the structural complexity and diversity in web pages, existing researches are limited to record-level data extraction. Beside, demand of extracting data from large amount of web pages make it a challenging task for researchers. Although the web data extracted by page-level approach is more complete than record-level approach, very few researches focus on this task because of the difficulties and complexities in the problem. On the other hands, existing web data extraction systems need IT background users, because these systems have not provide friendly GUI for users. In this pager, we provide a web data extraction systems based on M.-C. Chen and T.-S. Chen. We provide a friendly GUI for users to improve the training procedure of the schema induction process. The experimental results show that the performance on list page websites remain high and the performance on detail pages are increased precision 33.08% and recall 32.4%. In addition, improved system get highest recall than other systems. For accuracy, our system is higher than TEX with default threshold. If we adjust the threshold of models, we can improve the overall accuracy form 94.5% to 98.8%; Overall accuracy is 27% higher than TEX. Chia-hui Chang 張嘉惠 2015 學位論文 ; thesis 50 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	碩士 === 國立中央大學 === 軟體工程研究所 === 103 === The problem of web data extraction has been studied more than ten years. Because of the structural complexity and diversity in web pages, existing researches are limited to record-level data extraction. Beside, demand of extracting data from large amount of web pages make it a challenging task for researchers. Although the web data extracted by page-level approach is more complete than record-level approach, very few researches focus on this task because of the difficulties and complexities in the problem. On the other hands, existing web data extraction systems need IT background users, because these systems have not provide friendly GUI for users. In this pager, we provide a web data extraction systems based on M.-C. Chen and T.-S. Chen. We provide a friendly GUI for users to improve the training procedure of the schema induction process. The experimental results show that the performance on list page websites remain high and the performance on detail pages are increased precision 33.08% and recall 32.4%. In addition, improved system get highest recall than other systems. For accuracy, our system is higher than TEX with default threshold. If we adjust the threshold of models, we can improve the overall accuracy form 94.5% to 98.8%; Overall accuracy is 27% higher than TEX.
author2	Chia-hui Chang
author_facet	Chia-hui Chang Jhong-li Ding 丁中立
author	Jhong-li Ding 丁中立
spellingShingle	Jhong-li Ding 丁中立 Page-level Information Extraction System
author_sort	Jhong-li Ding
title	Page-level Information Extraction System
title_short	Page-level Information Extraction System
title_full	Page-level Information Extraction System
title_fullStr	Page-level Information Extraction System
title_full_unstemmed	Page-level Information Extraction System
title_sort	page-level information extraction system
publishDate	2015
url	http://ndltd.ncl.edu.tw/handle/80764607998088201775
work_keys_str_mv	AT jhongliding pagelevelinformationextractionsystem AT dīngzhōnglì pagelevelinformationextractionsystem AT jhongliding wǎngyècéngjízīliàoxiéqǔxìtǒng AT dīngzhōnglì wǎngyècéngjízīliàoxiéqǔxìtǒng
_version_	1718377188653268992

Page-level Information Extraction System

Similar Items