Summary: | 碩士 === 國立中央大學 === 資訊工程研究所 === 95 === The amount of information available on the World Wide Web has increased dramatically in recent years; however, many information resources are formatted for human browsing rather than for software programs. It is a demanding task to develop a tool to automatically extract information from semi-structured Web information sources to increase the utility of the Web for value-added services. This kind of tools is usually called wrapper. In this paper, we develop two methods based on signals to implement the wrapper. The first one is called” histogram and tag name-based correlation coefficient”. The method can discover correlation features between the template which the user marks and webpage, and implement the extraction system. In our method, templates for records with different tag structures will be incrementally generated by an ART-like algorithm, which follows the basic idea of the ART1 algorithm. Then records in a Web page can then be efficiently detected by using the generated templates via matching. The second method we propose is that we see every tag in a webpage having a weight, and then we can compute the area barycenter for it. Thus, after recording all the area barycenters, we will find the distribution can help us recognize the datas we want. After that, we propose an ontology-based method to integrate the information extracted from separate wrapped web sources by evaluating the similarities of the attributes between them. In this paper, we also propose a neural network-based approach for measuring semantic similarity between words.
Since the WWW is extremely dynamic and continually evolving, which results in frequent changes in the structures of Web documents, wrappers may not work as they did before. In this paper, we propose a filtering approach to implementing an automatic wrapper maintenance mechanism. The basic idea of the proposed method is to use a band-pass filter to automatically locate the contents of interest and then regenerate new templates of records in order to construct a new and correct wrapper.
|