A Domain-Specific Deep Web Query Interface Classifier

碩士 === 淡江大學 === 資訊管理學系碩士班 === 99 === From previous research, the amount of data of the deep web is about 400 to 550 times larger than that of the surface web. In order to retrieve the deep web content residing in databases, we need to find the entrances of the databases, which are the deep web...

Full description

Bibliographic Details
Main Authors: Pei-Tzu Chang, 張珮慈
Other Authors: Chichang Jou
Format: Others
Language:zh-TW
Published: 2011
Online Access:http://ndltd.ncl.edu.tw/handle/87554416458031123645
Description
Summary:碩士 === 淡江大學 === 資訊管理學系碩士班 === 99 === From previous research, the amount of data of the deep web is about 400 to 550 times larger than that of the surface web. In order to retrieve the deep web content residing in databases, we need to find the entrances of the databases, which are the deep web query interfaces. Moreover, since the content of deep web is domain-specific, to identify the deep web query interfaces from various web forms, we propose a two-phase analysis methodology which combines pre-query and post-query analyses, and develop an automatic deep web query interface classification technique. We not only can identify deep web query forms, but also can filter out search engine forms and site search forms, which are to extract static web pages inside a site. Before the classification, we would build feature words for the non-query forms, and would crawl a large scale of domain-specific query forms to extract the semantics of popular fields of that domain. In our classification system, in the pre-query analysis phase, we use feature words for the non-query forms to filter out non-query forms so that processing time at the next phase could be reduced. In the post-query analysis stage, we use the field semantics to fill in values and submit forms automatically, and then classify forms according to the query results of the forms. The experimental result shows our two-phase analysis methodology can obtain high precision. We can filter out not only the search engine forms and site search forms, but also deep web query forms which link to disabled databases.