Page-level Information Extraction System

碩士 === 國立中央大學 === 軟體工程研究所 === 103 === The problem of web data extraction has been studied more than ten years. Because of the structural complexity and diversity in web pages, existing researches are limited to record-level data extraction. Beside, demand of extracting data from large amount of web...

Full description

Bibliographic Details
Main Authors: Jhong-li Ding, 丁中立
Other Authors: Chia-hui Chang
Format: Others
Language:zh-TW
Published: 2015
Online Access:http://ndltd.ncl.edu.tw/handle/80764607998088201775
id ndltd-TW-103NCU05392081
record_format oai_dc
spelling ndltd-TW-103NCU053920812016-08-17T04:23:14Z http://ndltd.ncl.edu.tw/handle/80764607998088201775 Page-level Information Extraction System 網頁層級資料擷取系統 Jhong-li Ding 丁中立 碩士 國立中央大學 軟體工程研究所 103 The problem of web data extraction has been studied more than ten years. Because of the structural complexity and diversity in web pages, existing researches are limited to record-level data extraction. Beside, demand of extracting data from large amount of web pages make it a challenging task for researchers. Although the web data extracted by page-level approach is more complete than record-level approach, very few researches focus on this task because of the difficulties and complexities in the problem. On the other hands, existing web data extraction systems need IT background users, because these systems have not provide friendly GUI for users. In this pager, we provide a web data extraction systems based on M.-C. Chen and T.-S. Chen. We provide a friendly GUI for users to improve the training procedure of the schema induction process. The experimental results show that the performance on list page websites remain high and the performance on detail pages are increased precision 33.08% and recall 32.4%. In addition, improved system get highest recall than other systems. For accuracy, our system is higher than TEX with default threshold. If we adjust the threshold of models, we can improve the overall accuracy form 94.5% to 98.8%; Overall accuracy is 27% higher than TEX. Chia-hui Chang 張嘉惠 2015 學位論文 ; thesis 50 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立中央大學 === 軟體工程研究所 === 103 === The problem of web data extraction has been studied more than ten years. Because of the structural complexity and diversity in web pages, existing researches are limited to record-level data extraction. Beside, demand of extracting data from large amount of web pages make it a challenging task for researchers. Although the web data extracted by page-level approach is more complete than record-level approach, very few researches focus on this task because of the difficulties and complexities in the problem. On the other hands, existing web data extraction systems need IT background users, because these systems have not provide friendly GUI for users. In this pager, we provide a web data extraction systems based on M.-C. Chen and T.-S. Chen. We provide a friendly GUI for users to improve the training procedure of the schema induction process. The experimental results show that the performance on list page websites remain high and the performance on detail pages are increased precision 33.08% and recall 32.4%. In addition, improved system get highest recall than other systems. For accuracy, our system is higher than TEX with default threshold. If we adjust the threshold of models, we can improve the overall accuracy form 94.5% to 98.8%; Overall accuracy is 27% higher than TEX.
author2 Chia-hui Chang
author_facet Chia-hui Chang
Jhong-li Ding
丁中立
author Jhong-li Ding
丁中立
spellingShingle Jhong-li Ding
丁中立
Page-level Information Extraction System
author_sort Jhong-li Ding
title Page-level Information Extraction System
title_short Page-level Information Extraction System
title_full Page-level Information Extraction System
title_fullStr Page-level Information Extraction System
title_full_unstemmed Page-level Information Extraction System
title_sort page-level information extraction system
publishDate 2015
url http://ndltd.ncl.edu.tw/handle/80764607998088201775
work_keys_str_mv AT jhongliding pagelevelinformationextractionsystem
AT dīngzhōnglì pagelevelinformationextractionsystem
AT jhongliding wǎngyècéngjízīliàoxiéqǔxìtǒng
AT dīngzhōnglì wǎngyècéngjízīliàoxiéqǔxìtǒng
_version_ 1718377188653268992