Visual Tree Evaluation on Block Extraction

碩士 === 國立成功大學 === 資訊工程學系碩博士班 === 96 === More and More people use Cascading Style Sheets (CSS) to manage their Web pages, because CSS is easy and convenient to typesetting. However, CSS makes a Web pages displayed in an ambiguous structure. The data extraction systems based on Web page structure matc...

Full description

Bibliographic Details
Main Authors:	Wei-Ting Cho, 卓威廷
Other Authors:	Hung-Yu Kao
Format:	Others
Language:	en_US
Published:	2008
Online Access:	http://ndltd.ncl.edu.tw/handle/04206107953791017472

id	ndltd-TW-096NCKU5392088
record_format	oai_dc
spelling	ndltd-TW-096NCKU53920882015-11-23T04:03:11Z http://ndltd.ncl.edu.tw/handle/04206107953791017472 Visual Tree Evaluation on Block Extraction 基於視覺樹評估之區塊擷取 Wei-Ting Cho 卓威廷碩士國立成功大學資訊工程學系碩博士班 96 More and More people use Cascading Style Sheets (CSS) to manage their Web pages, because CSS is easy and convenient to typesetting. However, CSS makes a Web pages displayed in an ambiguous structure. The data extraction systems based on Web page structure matching could generate more mistaken judgments. Furthermore, they only identify blocks with similar structures. Some systems use specific HTML tags such as TABLE, TR, TD and P, to partition Web pages, but the proportion of the tags is generally less than DIV tags in CSS Web pages. In this paper, for solving the limitations, we present a system that applies properties of CSS Web page to extract data block. The system comprises three modules: Visual Tree Generation (VTG), Entropy Evaluation Model (EEM) and Block Identification (BI). Web pages are first converted into tree objects in the VTG module. The module transforms DOM trees into visual trees by using the visual information and HTML tag name of nodes to modify tree structure. The proposed visual tree presents the arrangement of data displayed on Web Browser, which meet the visual intention for evaluating informative blocks. If a block consists of diverse content, the block entropy will be relatively high. Thus entropy is proper to measure the information content of blocks for distinguishing presentation blocks from others. In the EEM module, the entropy attributes of each node in a visual tree is calculated. These attributes are used to identify block types by the BI module which comprises the block marking and block Refining. In the experiment, the result shows the node attributes and the visual tree are useful to extract blocks on CSS Web pages. Our system also outperforms with other systems on container block extraction. Hung-Yu Kao 高宏宇 2008 學位論文 ; thesis 52 en_US
collection	NDLTD
language	en_US
format	Others
sources	NDLTD
description	碩士 === 國立成功大學 === 資訊工程學系碩博士班 === 96 === More and More people use Cascading Style Sheets (CSS) to manage their Web pages, because CSS is easy and convenient to typesetting. However, CSS makes a Web pages displayed in an ambiguous structure. The data extraction systems based on Web page structure matching could generate more mistaken judgments. Furthermore, they only identify blocks with similar structures. Some systems use specific HTML tags such as TABLE, TR, TD and P, to partition Web pages, but the proportion of the tags is generally less than DIV tags in CSS Web pages. In this paper, for solving the limitations, we present a system that applies properties of CSS Web page to extract data block. The system comprises three modules: Visual Tree Generation (VTG), Entropy Evaluation Model (EEM) and Block Identification (BI). Web pages are first converted into tree objects in the VTG module. The module transforms DOM trees into visual trees by using the visual information and HTML tag name of nodes to modify tree structure. The proposed visual tree presents the arrangement of data displayed on Web Browser, which meet the visual intention for evaluating informative blocks. If a block consists of diverse content, the block entropy will be relatively high. Thus entropy is proper to measure the information content of blocks for distinguishing presentation blocks from others. In the EEM module, the entropy attributes of each node in a visual tree is calculated. These attributes are used to identify block types by the BI module which comprises the block marking and block Refining. In the experiment, the result shows the node attributes and the visual tree are useful to extract blocks on CSS Web pages. Our system also outperforms with other systems on container block extraction.
author2	Hung-Yu Kao
author_facet	Hung-Yu Kao Wei-Ting Cho 卓威廷
author	Wei-Ting Cho 卓威廷
spellingShingle	Wei-Ting Cho 卓威廷 Visual Tree Evaluation on Block Extraction
author_sort	Wei-Ting Cho
title	Visual Tree Evaluation on Block Extraction
title_short	Visual Tree Evaluation on Block Extraction
title_full	Visual Tree Evaluation on Block Extraction
title_fullStr	Visual Tree Evaluation on Block Extraction
title_full_unstemmed	Visual Tree Evaluation on Block Extraction
title_sort	visual tree evaluation on block extraction
publishDate	2008
url	http://ndltd.ncl.edu.tw/handle/04206107953791017472
work_keys_str_mv	AT weitingcho visualtreeevaluationonblockextraction AT zhuōwēitíng visualtreeevaluationonblockextraction AT weitingcho jīyúshìjuéshùpínggūzhīqūkuàixiéqǔ AT zhuōwēitíng jīyúshìjuéshùpínggūzhīqūkuàixiéqǔ
_version_	1718133997577437184

Visual Tree Evaluation on Block Extraction

Similar Items