A Domain-Specific Deep Web Query Interface Classifier

碩士 === 淡江大學 === 資訊管理學系碩士班 === 99 === From previous research, the amount of data of the deep web is about 400 to 550 times larger than that of the surface web. In order to retrieve the deep web content residing in databases, we need to find the entrances of the databases, which are the deep web...

Full description

Bibliographic Details
Main Authors: Pei-Tzu Chang, 張珮慈
Other Authors: Chichang Jou
Format: Others
Language:zh-TW
Published: 2011
Online Access:http://ndltd.ncl.edu.tw/handle/87554416458031123645
id ndltd-TW-099TKU05396001
record_format oai_dc
spelling ndltd-TW-099TKU053960012015-10-30T04:05:41Z http://ndltd.ncl.edu.tw/handle/87554416458031123645 A Domain-Specific Deep Web Query Interface Classifier 一個識別特定主題深網查詢介面的分類器 Pei-Tzu Chang 張珮慈 碩士 淡江大學 資訊管理學系碩士班 99 From previous research, the amount of data of the deep web is about 400 to 550 times larger than that of the surface web. In order to retrieve the deep web content residing in databases, we need to find the entrances of the databases, which are the deep web query interfaces. Moreover, since the content of deep web is domain-specific, to identify the deep web query interfaces from various web forms, we propose a two-phase analysis methodology which combines pre-query and post-query analyses, and develop an automatic deep web query interface classification technique. We not only can identify deep web query forms, but also can filter out search engine forms and site search forms, which are to extract static web pages inside a site. Before the classification, we would build feature words for the non-query forms, and would crawl a large scale of domain-specific query forms to extract the semantics of popular fields of that domain. In our classification system, in the pre-query analysis phase, we use feature words for the non-query forms to filter out non-query forms so that processing time at the next phase could be reduced. In the post-query analysis stage, we use the field semantics to fill in values and submit forms automatically, and then classify forms according to the query results of the forms. The experimental result shows our two-phase analysis methodology can obtain high precision. We can filter out not only the search engine forms and site search forms, but also deep web query forms which link to disabled databases. Chichang Jou 周清江 2011 學位論文 ; thesis 80 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 淡江大學 === 資訊管理學系碩士班 === 99 === From previous research, the amount of data of the deep web is about 400 to 550 times larger than that of the surface web. In order to retrieve the deep web content residing in databases, we need to find the entrances of the databases, which are the deep web query interfaces. Moreover, since the content of deep web is domain-specific, to identify the deep web query interfaces from various web forms, we propose a two-phase analysis methodology which combines pre-query and post-query analyses, and develop an automatic deep web query interface classification technique. We not only can identify deep web query forms, but also can filter out search engine forms and site search forms, which are to extract static web pages inside a site. Before the classification, we would build feature words for the non-query forms, and would crawl a large scale of domain-specific query forms to extract the semantics of popular fields of that domain. In our classification system, in the pre-query analysis phase, we use feature words for the non-query forms to filter out non-query forms so that processing time at the next phase could be reduced. In the post-query analysis stage, we use the field semantics to fill in values and submit forms automatically, and then classify forms according to the query results of the forms. The experimental result shows our two-phase analysis methodology can obtain high precision. We can filter out not only the search engine forms and site search forms, but also deep web query forms which link to disabled databases.
author2 Chichang Jou
author_facet Chichang Jou
Pei-Tzu Chang
張珮慈
author Pei-Tzu Chang
張珮慈
spellingShingle Pei-Tzu Chang
張珮慈
A Domain-Specific Deep Web Query Interface Classifier
author_sort Pei-Tzu Chang
title A Domain-Specific Deep Web Query Interface Classifier
title_short A Domain-Specific Deep Web Query Interface Classifier
title_full A Domain-Specific Deep Web Query Interface Classifier
title_fullStr A Domain-Specific Deep Web Query Interface Classifier
title_full_unstemmed A Domain-Specific Deep Web Query Interface Classifier
title_sort domain-specific deep web query interface classifier
publishDate 2011
url http://ndltd.ncl.edu.tw/handle/87554416458031123645
work_keys_str_mv AT peitzuchang adomainspecificdeepwebqueryinterfaceclassifier
AT zhāngpèicí adomainspecificdeepwebqueryinterfaceclassifier
AT peitzuchang yīgèshíbiétèdìngzhǔtíshēnwǎngcháxúnjièmiàndefēnlèiqì
AT zhāngpèicí yīgèshíbiétèdìngzhǔtíshēnwǎngcháxúnjièmiàndefēnlèiqì
AT peitzuchang domainspecificdeepwebqueryinterfaceclassifier
AT zhāngpèicí domainspecificdeepwebqueryinterfaceclassifier
_version_ 1718116793790234624