Deep Web Search Interface Identification: A Semi-Supervised Ensemble Approach
To surface the Deep Web, one crucial task is to predict whether a given web page has a search interface (searchable HyperText Markup Language (HTML) form) or not. Previous studies have focused on supervised classification with labeled examples. However, labeled data are scarce, hard to get and requi...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2014-12-01
|
Series: | Information |
Subjects: | |
Online Access: | http://www.mdpi.com/2078-2489/5/4/634 |
id |
doaj-aef6e4cf9235458e93e523c84db8699d |
---|---|
record_format |
Article |
spelling |
doaj-aef6e4cf9235458e93e523c84db8699d2020-11-24T23:21:12ZengMDPI AGInformation2078-24892014-12-015463465110.3390/info5040634info5040634Deep Web Search Interface Identification: A Semi-Supervised Ensemble ApproachHong Wang0Qingsong Xu1Lifeng Zhou2School of Mathematics & Statistics, Central South University, Changsha 410075, ChinaSchool of Mathematics & Statistics, Central South University, Changsha 410075, ChinaSchool of Mathematics & Statistics, Central South University, Changsha 410075, ChinaTo surface the Deep Web, one crucial task is to predict whether a given web page has a search interface (searchable HyperText Markup Language (HTML) form) or not. Previous studies have focused on supervised classification with labeled examples. However, labeled data are scarce, hard to get and requires tediousmanual work, while unlabeled HTML forms are abundant and easy to obtain. In this research, we consider the plausibility of using both labeled and unlabeled data to train better models to identify search interfaces more effectively. We present a semi-supervised co-training ensemble learning approach using both neural networks and decision trees to deal with the search interface identification problem. We show that the proposed model outperforms previous methods using only labeled data. We also show that adding unlabeled data improves the effectiveness of the proposed model.http://www.mdpi.com/2078-2489/5/4/634semi-supervised learningDeep Web miningsearch interface identificationensemble learning |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Hong Wang Qingsong Xu Lifeng Zhou |
spellingShingle |
Hong Wang Qingsong Xu Lifeng Zhou Deep Web Search Interface Identification: A Semi-Supervised Ensemble Approach Information semi-supervised learning Deep Web mining search interface identification ensemble learning |
author_facet |
Hong Wang Qingsong Xu Lifeng Zhou |
author_sort |
Hong Wang |
title |
Deep Web Search Interface Identification: A Semi-Supervised Ensemble Approach |
title_short |
Deep Web Search Interface Identification: A Semi-Supervised Ensemble Approach |
title_full |
Deep Web Search Interface Identification: A Semi-Supervised Ensemble Approach |
title_fullStr |
Deep Web Search Interface Identification: A Semi-Supervised Ensemble Approach |
title_full_unstemmed |
Deep Web Search Interface Identification: A Semi-Supervised Ensemble Approach |
title_sort |
deep web search interface identification: a semi-supervised ensemble approach |
publisher |
MDPI AG |
series |
Information |
issn |
2078-2489 |
publishDate |
2014-12-01 |
description |
To surface the Deep Web, one crucial task is to predict whether a given web page has a search interface (searchable HyperText Markup Language (HTML) form) or not. Previous studies have focused on supervised classification with labeled examples. However, labeled data are scarce, hard to get and requires tediousmanual work, while unlabeled HTML forms are abundant and easy to obtain. In this research, we consider the plausibility of using both labeled and unlabeled data to train better models to identify search interfaces more effectively. We present a semi-supervised co-training ensemble learning approach using both neural networks and decision trees to deal with the search interface identification problem. We show that the proposed model outperforms previous methods using only labeled data. We also show that adding unlabeled data improves the effectiveness of the proposed model. |
topic |
semi-supervised learning Deep Web mining search interface identification ensemble learning |
url |
http://www.mdpi.com/2078-2489/5/4/634 |
work_keys_str_mv |
AT hongwang deepwebsearchinterfaceidentificationasemisupervisedensembleapproach AT qingsongxu deepwebsearchinterfaceidentificationasemisupervisedensembleapproach AT lifengzhou deepwebsearchinterfaceidentificationasemisupervisedensembleapproach |
_version_ |
1725572265639149568 |