Visual Tree Evaluation on Block Extraction

碩士 === 國立成功大學 === 資訊工程學系碩博士班 === 96 === More and More people use Cascading Style Sheets (CSS) to manage their Web pages, because CSS is easy and convenient to typesetting. However, CSS makes a Web pages displayed in an ambiguous structure. The data extraction systems based on Web page structure matc...

Full description

Bibliographic Details
Main Authors: Wei-Ting Cho, 卓威廷
Other Authors: Hung-Yu Kao
Format: Others
Language:en_US
Published: 2008
Online Access:http://ndltd.ncl.edu.tw/handle/04206107953791017472
id ndltd-TW-096NCKU5392088
record_format oai_dc
spelling ndltd-TW-096NCKU53920882015-11-23T04:03:11Z http://ndltd.ncl.edu.tw/handle/04206107953791017472 Visual Tree Evaluation on Block Extraction 基於視覺樹評估之區塊擷取 Wei-Ting Cho 卓威廷 碩士 國立成功大學 資訊工程學系碩博士班 96 More and More people use Cascading Style Sheets (CSS) to manage their Web pages, because CSS is easy and convenient to typesetting. However, CSS makes a Web pages displayed in an ambiguous structure. The data extraction systems based on Web page structure matching could generate more mistaken judgments. Furthermore, they only identify blocks with similar structures. Some systems use specific HTML tags such as TABLE, TR, TD and P, to partition Web pages, but the proportion of the tags is generally less than DIV tags in CSS Web pages. In this paper, for solving the limitations, we present a system that applies properties of CSS Web page to extract data block. The system comprises three modules: Visual Tree Generation (VTG), Entropy Evaluation Model (EEM) and Block Identification (BI). Web pages are first converted into tree objects in the VTG module. The module transforms DOM trees into visual trees by using the visual information and HTML tag name of nodes to modify tree structure. The proposed visual tree presents the arrangement of data displayed on Web Browser, which meet the visual intention for evaluating informative blocks. If a block consists of diverse content, the block entropy will be relatively high. Thus entropy is proper to measure the information content of blocks for distinguishing presentation blocks from others. In the EEM module, the entropy attributes of each node in a visual tree is calculated. These attributes are used to identify block types by the BI module which comprises the block marking and block Refining. In the experiment, the result shows the node attributes and the visual tree are useful to extract blocks on CSS Web pages. Our system also outperforms with other systems on container block extraction. Hung-Yu Kao 高宏宇 2008 學位論文 ; thesis 52 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 碩士 === 國立成功大學 === 資訊工程學系碩博士班 === 96 === More and More people use Cascading Style Sheets (CSS) to manage their Web pages, because CSS is easy and convenient to typesetting. However, CSS makes a Web pages displayed in an ambiguous structure. The data extraction systems based on Web page structure matching could generate more mistaken judgments. Furthermore, they only identify blocks with similar structures. Some systems use specific HTML tags such as TABLE, TR, TD and P, to partition Web pages, but the proportion of the tags is generally less than DIV tags in CSS Web pages. In this paper, for solving the limitations, we present a system that applies properties of CSS Web page to extract data block. The system comprises three modules: Visual Tree Generation (VTG), Entropy Evaluation Model (EEM) and Block Identification (BI). Web pages are first converted into tree objects in the VTG module. The module transforms DOM trees into visual trees by using the visual information and HTML tag name of nodes to modify tree structure. The proposed visual tree presents the arrangement of data displayed on Web Browser, which meet the visual intention for evaluating informative blocks. If a block consists of diverse content, the block entropy will be relatively high. Thus entropy is proper to measure the information content of blocks for distinguishing presentation blocks from others. In the EEM module, the entropy attributes of each node in a visual tree is calculated. These attributes are used to identify block types by the BI module which comprises the block marking and block Refining. In the experiment, the result shows the node attributes and the visual tree are useful to extract blocks on CSS Web pages. Our system also outperforms with other systems on container block extraction.
author2 Hung-Yu Kao
author_facet Hung-Yu Kao
Wei-Ting Cho
卓威廷
author Wei-Ting Cho
卓威廷
spellingShingle Wei-Ting Cho
卓威廷
Visual Tree Evaluation on Block Extraction
author_sort Wei-Ting Cho
title Visual Tree Evaluation on Block Extraction
title_short Visual Tree Evaluation on Block Extraction
title_full Visual Tree Evaluation on Block Extraction
title_fullStr Visual Tree Evaluation on Block Extraction
title_full_unstemmed Visual Tree Evaluation on Block Extraction
title_sort visual tree evaluation on block extraction
publishDate 2008
url http://ndltd.ncl.edu.tw/handle/04206107953791017472
work_keys_str_mv AT weitingcho visualtreeevaluationonblockextraction
AT zhuōwēitíng visualtreeevaluationonblockextraction
AT weitingcho jīyúshìjuéshùpínggūzhīqūkuàixiéqǔ
AT zhuōwēitíng jīyúshìjuéshùpínggūzhīqūkuàixiéqǔ
_version_ 1718133997577437184