Summary: | 碩士 === 國立臺灣大學 === 資訊工程學系 === 85 === The emerge of the World Wide Web, provides a huge amount of available
information. Extracting useful information from it could be very useful.
However, because of its tremendous size and lacking an universal document
structure, this is not an easy task.
In this thesis, we develop a rule-based method to extract information from
an HTML file. The method we developed is a way to specify what we want to
extract from a document. Given such a rule, the information extracting
program can extract information according to that rule, if there is
information that fits the specification of the rule.
Given a new document, if there
is no rule about the information format within this new document,
the extracting program can not do anything. In this thesis, we develop a
rule generation method for documents retrieved from web
indexes. With slightly modified, our method can be extended to documents
that have the following properties:
1. There are K sections within the document, all with the same format.
2. These K sections include the information that we want.
3. Each section includes at least one hyperlink.
Where K is a given positive integer.
Based on this information extraction method, we implement two experimental
systems to test its usability. They are an instructed spider, and a
meta-search engine. An instructed spider can be taught how to traverse, and
what to collect. And a meta-search engine posts keyword queries to several
search engines and rearrange the responses in a more friendly format.
|