Summary: | 碩士 === 國立交通大學 === 資訊學院碩士在職專班資訊組 === 95 === According to the statistical data, there are more than 14 billion web pages in whole world by June of 2007. It’s a important thing that how to use this huge database efficiently. For the information that we do not know its location, we usually use search engines to help us to find it out. And for the information that we do know where it is, we use data extraction to increase the efficiency.
BODE (Browser Oriented Data Extraction), developed by our laboratory, is such a web data extraction system. Its GUI can be used to indicate the data they want to retrieve, and the system will generate the BODE script that is used in the extraction process, and then start to extract.
However, people must have the basic knowledge about the syntax of BODE script, XPath and HTML Tag to build the BODE script. To reduce the threshold of using BODE system, this
thesis proposes an algorithm to distinguish the useful information blocks from a single web site, so as to accomplish the goal of automatically generating BODE script.
|