Summary: | 碩士 === 淡江大學 === 資訊管理學系碩士班 === 100 === Along with the fast popularity of the internet, the contents inside web databases also increase quickly. These data, hidden behind the query interfaces, are called Deep Web. In order to obtain the dynamic contents which satisfy the conditions imposed by the input parameters, the internet users must keyin proper parameters. This is the reason why the above contents are not collected by the search engines, which cause the internet users lose important information easily. However, before building a system which could collect the contents of Deep Web automatically, a system for extracting schemas of query interfaces must be established first to obtain mappings of input elements and labels, data types of legitimate input values, and range constraints of the input values, etc. Then it is possible to automatically input proper values for elements in the query interfaces to extract the dynamic contents. We would like to build a schema extraction system for query interfaces of the deep web. Based on the layout expressions for form extraction proposed by He, we extract elements, labels and new lines of query interfaces to produce their IEXP, Interface Expression. Besides, we combine the users'' view and the designers'' view, and use ICQ dataset as the foundation to propose the heuristic rules for extracting the schema. We solve the problem that visional elements and their mapping labels are close but not mapped correctly, without abandoning the concept that elements and their mapping labels should not be separated far away. The proposed layered model for schema not only helps extracting contents of the Deep Web, but also benefits the processes of schema matching and schema merge. We examine the performance of the schema extraction system by the TEL-8 dataset and query interfaces gathered by the past research. The result reveals that our system produces effective results.
|