An Information Extraction Method for HTML Documents and its Applications

碩士 === 國立臺灣大學 === 資訊工程學系 === 85 === The emerge of the World Wide Web, provides a huge amount of available information. Extracting useful information from it could be very useful. However, because of its tremendous size and lacking an universal document st...

Full description

Bibliographic Details
Main Authors: Pan, Jia-Yu, 潘家煜
Other Authors: Hsiang Jieh
Format: Others
Language:zh-TW
Published: 1997
Online Access:http://ndltd.ncl.edu.tw/handle/62226478929063142108
Description
Summary:碩士 === 國立臺灣大學 === 資訊工程學系 === 85 === The emerge of the World Wide Web, provides a huge amount of available information. Extracting useful information from it could be very useful. However, because of its tremendous size and lacking an universal document structure, this is not an easy task. In this thesis, we develop a rule-based method to extract information from an HTML file. The method we developed is a way to specify what we want to extract from a document. Given such a rule, the information extracting program can extract information according to that rule, if there is information that fits the specification of the rule. Given a new document, if there is no rule about the information format within this new document, the extracting program can not do anything. In this thesis, we develop a rule generation method for documents retrieved from web indexes. With slightly modified, our method can be extended to documents that have the following properties: 1. There are K sections within the document, all with the same format. 2. These K sections include the information that we want. 3. Each section includes at least one hyperlink. Where K is a given positive integer. Based on this information extraction method, we implement two experimental systems to test its usability. They are an instructed spider, and a meta-search engine. An instructed spider can be taught how to traverse, and what to collect. And a meta-search engine posts keyword queries to several search engines and rearrange the responses in a more friendly format.