A Machine Learning Based Approach to WebExtraction from Template Pages

碩士 === 國立中央大學 === 資訊工程學系碩士在職專班 === 98 === A huge amount of information on the World Wide Web has a structured HTML form as they are generated dynamically from databases and have the same template. This paper proposes a page-level web data extraction system FiVaTech2 that extracts schema and template...

Full description

Bibliographic Details
Main Authors:	Chih-Hao Chang, 張志豪
Other Authors:	Chia-Hui Chang
Format:	Others
Language:	en_US
Published:	2010
Online Access:	http://ndltd.ncl.edu.tw/handle/35548787181124476380

id	ndltd-TW-098NCU05392091
record_format	oai_dc
spelling	ndltd-TW-098NCU053920912016-04-20T04:18:01Z http://ndltd.ncl.edu.tw/handle/35548787181124476380 A Machine Learning Based Approach to WebExtraction from Template Pages 機器學習應用於樣版網頁擷取之研究 Chih-Hao Chang 張志豪碩士國立中央大學資訊工程學系碩士在職專班 98 A huge amount of information on the World Wide Web has a structured HTML form as they are generated dynamically from databases and have the same template. This paper proposes a page-level web data extraction system FiVaTech2 that extracts schema and templates from these template-based web pages automatically. The proposed system, FiVaTech2, is an extension to our previously page-level web data extraction system FiVaTech. FiVaTech2 uses a machine learning (ML) based method which compares HTML tag pairs to estimate how likely they present in the web pages. We use one of the ML techniques called J48 decision tree classifier and also use image comparison to assist templates detection. Each HTML tag in the web page has several features that can be divided into the three types: visual information, DOM tree information, and HTML tag contents. Our experiments show an encouraging result for the test pages when combinations of the three types of tag features are used. Also, our experiments show that FiVaTech2 performs better and has higher efficiency than FiVaTech. Chia-Hui Chang 張嘉惠 2010 學位論文 ; thesis 33 en_US
collection	NDLTD
language	en_US
format	Others
sources	NDLTD
description	碩士 === 國立中央大學 === 資訊工程學系碩士在職專班 === 98 === A huge amount of information on the World Wide Web has a structured HTML form as they are generated dynamically from databases and have the same template. This paper proposes a page-level web data extraction system FiVaTech2 that extracts schema and templates from these template-based web pages automatically. The proposed system, FiVaTech2, is an extension to our previously page-level web data extraction system FiVaTech. FiVaTech2 uses a machine learning (ML) based method which compares HTML tag pairs to estimate how likely they present in the web pages. We use one of the ML techniques called J48 decision tree classifier and also use image comparison to assist templates detection. Each HTML tag in the web page has several features that can be divided into the three types: visual information, DOM tree information, and HTML tag contents. Our experiments show an encouraging result for the test pages when combinations of the three types of tag features are used. Also, our experiments show that FiVaTech2 performs better and has higher efficiency than FiVaTech.
author2	Chia-Hui Chang
author_facet	Chia-Hui Chang Chih-Hao Chang 張志豪
author	Chih-Hao Chang 張志豪
spellingShingle	Chih-Hao Chang 張志豪 A Machine Learning Based Approach to WebExtraction from Template Pages
author_sort	Chih-Hao Chang
title	A Machine Learning Based Approach to WebExtraction from Template Pages
title_short	A Machine Learning Based Approach to WebExtraction from Template Pages
title_full	A Machine Learning Based Approach to WebExtraction from Template Pages
title_fullStr	A Machine Learning Based Approach to WebExtraction from Template Pages
title_full_unstemmed	A Machine Learning Based Approach to WebExtraction from Template Pages
title_sort	machine learning based approach to webextraction from template pages
publishDate	2010
url	http://ndltd.ncl.edu.tw/handle/35548787181124476380
work_keys_str_mv	AT chihhaochang amachinelearningbasedapproachtowebextractionfromtemplatepages AT zhāngzhìháo amachinelearningbasedapproachtowebextractionfromtemplatepages AT chihhaochang jīqìxuéxíyīngyòngyúyàngbǎnwǎngyèxiéqǔzhīyánjiū AT zhāngzhìháo jīqìxuéxíyīngyòngyúyàngbǎnwǎngyèxiéqǔzhīyánjiū AT chihhaochang machinelearningbasedapproachtowebextractionfromtemplatepages AT zhāngzhìháo machinelearningbasedapproachtowebextractionfromtemplatepages
_version_	1718228162041610240

A Machine Learning Based Approach to WebExtraction from Template Pages

Similar Items