Page-level Information Extraction System
碩士 === 國立中央大學 === 軟體工程研究所 === 103 === The problem of web data extraction has been studied more than ten years. Because of the structural complexity and diversity in web pages, existing researches are limited to record-level data extraction. Beside, demand of extracting data from large amount of web...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | zh-TW |
Published: |
2015
|
Online Access: | http://ndltd.ncl.edu.tw/handle/80764607998088201775 |
id |
ndltd-TW-103NCU05392081 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-103NCU053920812016-08-17T04:23:14Z http://ndltd.ncl.edu.tw/handle/80764607998088201775 Page-level Information Extraction System 網頁層級資料擷取系統 Jhong-li Ding 丁中立 碩士 國立中央大學 軟體工程研究所 103 The problem of web data extraction has been studied more than ten years. Because of the structural complexity and diversity in web pages, existing researches are limited to record-level data extraction. Beside, demand of extracting data from large amount of web pages make it a challenging task for researchers. Although the web data extracted by page-level approach is more complete than record-level approach, very few researches focus on this task because of the difficulties and complexities in the problem. On the other hands, existing web data extraction systems need IT background users, because these systems have not provide friendly GUI for users. In this pager, we provide a web data extraction systems based on M.-C. Chen and T.-S. Chen. We provide a friendly GUI for users to improve the training procedure of the schema induction process. The experimental results show that the performance on list page websites remain high and the performance on detail pages are increased precision 33.08% and recall 32.4%. In addition, improved system get highest recall than other systems. For accuracy, our system is higher than TEX with default threshold. If we adjust the threshold of models, we can improve the overall accuracy form 94.5% to 98.8%; Overall accuracy is 27% higher than TEX. Chia-hui Chang 張嘉惠 2015 學位論文 ; thesis 50 zh-TW |
collection |
NDLTD |
language |
zh-TW |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立中央大學 === 軟體工程研究所 === 103 === The problem of web data extraction has been studied more than ten years. Because of the structural complexity and diversity in web pages, existing researches are limited to record-level data extraction. Beside, demand of extracting data from large amount of web pages make it a challenging task for researchers.
Although the web data extracted by page-level approach is more complete than record-level approach, very few researches focus on this task because of the difficulties and complexities in the problem. On the other hands, existing web data extraction systems need IT background users, because these systems have not provide friendly GUI for users.
In this pager, we provide a web data extraction systems based on M.-C. Chen and T.-S. Chen. We provide a friendly GUI for users to improve the training procedure of the schema induction process. The experimental results show that the performance on list page websites remain high and the performance on detail pages are increased precision 33.08% and recall 32.4%. In addition, improved system get highest recall than other systems. For accuracy, our system is higher than TEX with default threshold. If we adjust the threshold of models, we can improve the overall accuracy form 94.5% to 98.8%; Overall accuracy is 27% higher than TEX.
|
author2 |
Chia-hui Chang |
author_facet |
Chia-hui Chang Jhong-li Ding 丁中立 |
author |
Jhong-li Ding 丁中立 |
spellingShingle |
Jhong-li Ding 丁中立 Page-level Information Extraction System |
author_sort |
Jhong-li Ding |
title |
Page-level Information Extraction System |
title_short |
Page-level Information Extraction System |
title_full |
Page-level Information Extraction System |
title_fullStr |
Page-level Information Extraction System |
title_full_unstemmed |
Page-level Information Extraction System |
title_sort |
page-level information extraction system |
publishDate |
2015 |
url |
http://ndltd.ncl.edu.tw/handle/80764607998088201775 |
work_keys_str_mv |
AT jhongliding pagelevelinformationextractionsystem AT dīngzhōnglì pagelevelinformationextractionsystem AT jhongliding wǎngyècéngjízīliàoxiéqǔxìtǒng AT dīngzhōnglì wǎngyècéngjízīliàoxiéqǔxìtǒng |
_version_ |
1718377188653268992 |