A Novel On-Line Learning Wrapper System and Its Automatic Maintenance Mechanism

碩士 === 國立中央大學 === 資訊工程研究所 === 95 === The amount of information available on the World Wide Web has increased dramatically in recent years; however, many information resources are formatted for human browsing rather than for software programs. It is a demanding task to develop a tool to automatically...

Full description

Bibliographic Details
Main Authors: Shao-Jui Wang, 王紹睿
Other Authors: 蘇木春
Format: Others
Language:zh-TW
Published: 2007
Online Access:http://ndltd.ncl.edu.tw/handle/81458469299789103752
id ndltd-TW-095NCU05392042
record_format oai_dc
spelling ndltd-TW-095NCU053920422015-10-13T13:59:55Z http://ndltd.ncl.edu.tw/handle/81458469299789103752 A Novel On-Line Learning Wrapper System and Its Automatic Maintenance Mechanism 具線上學習之擷取系統和其自動維護機制 Shao-Jui Wang 王紹睿 碩士 國立中央大學 資訊工程研究所 95 The amount of information available on the World Wide Web has increased dramatically in recent years; however, many information resources are formatted for human browsing rather than for software programs. It is a demanding task to develop a tool to automatically extract information from semi-structured Web information sources to increase the utility of the Web for value-added services. This kind of tools is usually called wrapper. In this paper, we develop two methods based on signals to implement the wrapper. The first one is called” histogram and tag name-based correlation coefficient”. The method can discover correlation features between the template which the user marks and webpage, and implement the extraction system. In our method, templates for records with different tag structures will be incrementally generated by an ART-like algorithm, which follows the basic idea of the ART1 algorithm. Then records in a Web page can then be efficiently detected by using the generated templates via matching. The second method we propose is that we see every tag in a webpage having a weight, and then we can compute the area barycenter for it. Thus, after recording all the area barycenters, we will find the distribution can help us recognize the datas we want. After that, we propose an ontology-based method to integrate the information extracted from separate wrapped web sources by evaluating the similarities of the attributes between them. In this paper, we also propose a neural network-based approach for measuring semantic similarity between words. Since the WWW is extremely dynamic and continually evolving, which results in frequent changes in the structures of Web documents, wrappers may not work as they did before. In this paper, we propose a filtering approach to implementing an automatic wrapper maintenance mechanism. The basic idea of the proposed method is to use a band-pass filter to automatically locate the contents of interest and then regenerate new templates of records in order to construct a new and correct wrapper. 蘇木春 2007 學位論文 ; thesis 133 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立中央大學 === 資訊工程研究所 === 95 === The amount of information available on the World Wide Web has increased dramatically in recent years; however, many information resources are formatted for human browsing rather than for software programs. It is a demanding task to develop a tool to automatically extract information from semi-structured Web information sources to increase the utility of the Web for value-added services. This kind of tools is usually called wrapper. In this paper, we develop two methods based on signals to implement the wrapper. The first one is called” histogram and tag name-based correlation coefficient”. The method can discover correlation features between the template which the user marks and webpage, and implement the extraction system. In our method, templates for records with different tag structures will be incrementally generated by an ART-like algorithm, which follows the basic idea of the ART1 algorithm. Then records in a Web page can then be efficiently detected by using the generated templates via matching. The second method we propose is that we see every tag in a webpage having a weight, and then we can compute the area barycenter for it. Thus, after recording all the area barycenters, we will find the distribution can help us recognize the datas we want. After that, we propose an ontology-based method to integrate the information extracted from separate wrapped web sources by evaluating the similarities of the attributes between them. In this paper, we also propose a neural network-based approach for measuring semantic similarity between words. Since the WWW is extremely dynamic and continually evolving, which results in frequent changes in the structures of Web documents, wrappers may not work as they did before. In this paper, we propose a filtering approach to implementing an automatic wrapper maintenance mechanism. The basic idea of the proposed method is to use a band-pass filter to automatically locate the contents of interest and then regenerate new templates of records in order to construct a new and correct wrapper.
author2 蘇木春
author_facet 蘇木春
Shao-Jui Wang
王紹睿
author Shao-Jui Wang
王紹睿
spellingShingle Shao-Jui Wang
王紹睿
A Novel On-Line Learning Wrapper System and Its Automatic Maintenance Mechanism
author_sort Shao-Jui Wang
title A Novel On-Line Learning Wrapper System and Its Automatic Maintenance Mechanism
title_short A Novel On-Line Learning Wrapper System and Its Automatic Maintenance Mechanism
title_full A Novel On-Line Learning Wrapper System and Its Automatic Maintenance Mechanism
title_fullStr A Novel On-Line Learning Wrapper System and Its Automatic Maintenance Mechanism
title_full_unstemmed A Novel On-Line Learning Wrapper System and Its Automatic Maintenance Mechanism
title_sort novel on-line learning wrapper system and its automatic maintenance mechanism
publishDate 2007
url http://ndltd.ncl.edu.tw/handle/81458469299789103752
work_keys_str_mv AT shaojuiwang anovelonlinelearningwrappersystemanditsautomaticmaintenancemechanism
AT wángshàoruì anovelonlinelearningwrappersystemanditsautomaticmaintenancemechanism
AT shaojuiwang jùxiànshàngxuéxízhīxiéqǔxìtǒnghéqízìdòngwéihùjīzhì
AT wángshàoruì jùxiànshàngxuéxízhīxiéqǔxìtǒnghéqízìdòngwéihùjīzhì
AT shaojuiwang novelonlinelearningwrappersystemanditsautomaticmaintenancemechanism
AT wángshàoruì novelonlinelearningwrappersystemanditsautomaticmaintenancemechanism
_version_ 1717746834091278336