Deep Web Search Interface Identification: A Semi-Supervised Ensemble Approach

To surface the Deep Web, one crucial task is to predict whether a given web page has a search interface (searchable HyperText Markup Language (HTML) form) or not. Previous studies have focused on supervised classification with labeled examples. However, labeled data are scarce, hard to get and requi...

Full description

Bibliographic Details
Main Authors: Hong Wang, Qingsong Xu, Lifeng Zhou
Format: Article
Language:English
Published: MDPI AG 2014-12-01
Series:Information
Subjects:
Online Access:http://www.mdpi.com/2078-2489/5/4/634
id doaj-aef6e4cf9235458e93e523c84db8699d
record_format Article
spelling doaj-aef6e4cf9235458e93e523c84db8699d2020-11-24T23:21:12ZengMDPI AGInformation2078-24892014-12-015463465110.3390/info5040634info5040634Deep Web Search Interface Identification: A Semi-Supervised Ensemble ApproachHong Wang0Qingsong Xu1Lifeng Zhou2School of Mathematics & Statistics, Central South University, Changsha 410075, ChinaSchool of Mathematics & Statistics, Central South University, Changsha 410075, ChinaSchool of Mathematics & Statistics, Central South University, Changsha 410075, ChinaTo surface the Deep Web, one crucial task is to predict whether a given web page has a search interface (searchable HyperText Markup Language (HTML) form) or not. Previous studies have focused on supervised classification with labeled examples. However, labeled data are scarce, hard to get and requires tediousmanual work, while unlabeled HTML forms are abundant and easy to obtain. In this research, we consider the plausibility of using both labeled and unlabeled data to train better models to identify search interfaces more effectively. We present a semi-supervised co-training ensemble learning approach using both neural networks and decision trees to deal with the search interface identification problem. We show that the proposed model outperforms previous methods using only labeled data. We also show that adding unlabeled data improves the effectiveness of the proposed model.http://www.mdpi.com/2078-2489/5/4/634semi-supervised learningDeep Web miningsearch interface identificationensemble learning
collection DOAJ
language English
format Article
sources DOAJ
author Hong Wang
Qingsong Xu
Lifeng Zhou
spellingShingle Hong Wang
Qingsong Xu
Lifeng Zhou
Deep Web Search Interface Identification: A Semi-Supervised Ensemble Approach
Information
semi-supervised learning
Deep Web mining
search interface identification
ensemble learning
author_facet Hong Wang
Qingsong Xu
Lifeng Zhou
author_sort Hong Wang
title Deep Web Search Interface Identification: A Semi-Supervised Ensemble Approach
title_short Deep Web Search Interface Identification: A Semi-Supervised Ensemble Approach
title_full Deep Web Search Interface Identification: A Semi-Supervised Ensemble Approach
title_fullStr Deep Web Search Interface Identification: A Semi-Supervised Ensemble Approach
title_full_unstemmed Deep Web Search Interface Identification: A Semi-Supervised Ensemble Approach
title_sort deep web search interface identification: a semi-supervised ensemble approach
publisher MDPI AG
series Information
issn 2078-2489
publishDate 2014-12-01
description To surface the Deep Web, one crucial task is to predict whether a given web page has a search interface (searchable HyperText Markup Language (HTML) form) or not. Previous studies have focused on supervised classification with labeled examples. However, labeled data are scarce, hard to get and requires tediousmanual work, while unlabeled HTML forms are abundant and easy to obtain. In this research, we consider the plausibility of using both labeled and unlabeled data to train better models to identify search interfaces more effectively. We present a semi-supervised co-training ensemble learning approach using both neural networks and decision trees to deal with the search interface identification problem. We show that the proposed model outperforms previous methods using only labeled data. We also show that adding unlabeled data improves the effectiveness of the proposed model.
topic semi-supervised learning
Deep Web mining
search interface identification
ensemble learning
url http://www.mdpi.com/2078-2489/5/4/634
work_keys_str_mv AT hongwang deepwebsearchinterfaceidentificationasemisupervisedensembleapproach
AT qingsongxu deepwebsearchinterfaceidentificationasemisupervisedensembleapproach
AT lifengzhou deepwebsearchinterfaceidentificationasemisupervisedensembleapproach
_version_ 1725572265639149568