Visual Tree Evaluation on Block Extraction
碩士 === 國立成功大學 === 資訊工程學系碩博士班 === 96 === More and More people use Cascading Style Sheets (CSS) to manage their Web pages, because CSS is easy and convenient to typesetting. However, CSS makes a Web pages displayed in an ambiguous structure. The data extraction systems based on Web page structure matc...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2008
|
Online Access: | http://ndltd.ncl.edu.tw/handle/04206107953791017472 |
id |
ndltd-TW-096NCKU5392088 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-096NCKU53920882015-11-23T04:03:11Z http://ndltd.ncl.edu.tw/handle/04206107953791017472 Visual Tree Evaluation on Block Extraction 基於視覺樹評估之區塊擷取 Wei-Ting Cho 卓威廷 碩士 國立成功大學 資訊工程學系碩博士班 96 More and More people use Cascading Style Sheets (CSS) to manage their Web pages, because CSS is easy and convenient to typesetting. However, CSS makes a Web pages displayed in an ambiguous structure. The data extraction systems based on Web page structure matching could generate more mistaken judgments. Furthermore, they only identify blocks with similar structures. Some systems use specific HTML tags such as TABLE, TR, TD and P, to partition Web pages, but the proportion of the tags is generally less than DIV tags in CSS Web pages. In this paper, for solving the limitations, we present a system that applies properties of CSS Web page to extract data block. The system comprises three modules: Visual Tree Generation (VTG), Entropy Evaluation Model (EEM) and Block Identification (BI). Web pages are first converted into tree objects in the VTG module. The module transforms DOM trees into visual trees by using the visual information and HTML tag name of nodes to modify tree structure. The proposed visual tree presents the arrangement of data displayed on Web Browser, which meet the visual intention for evaluating informative blocks. If a block consists of diverse content, the block entropy will be relatively high. Thus entropy is proper to measure the information content of blocks for distinguishing presentation blocks from others. In the EEM module, the entropy attributes of each node in a visual tree is calculated. These attributes are used to identify block types by the BI module which comprises the block marking and block Refining. In the experiment, the result shows the node attributes and the visual tree are useful to extract blocks on CSS Web pages. Our system also outperforms with other systems on container block extraction. Hung-Yu Kao 高宏宇 2008 學位論文 ; thesis 52 en_US |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立成功大學 === 資訊工程學系碩博士班 === 96 === More and More people use Cascading Style Sheets (CSS) to manage their Web pages, because CSS is easy and convenient to typesetting. However, CSS makes a Web pages displayed in an ambiguous structure. The data extraction systems based on Web page structure matching could generate more mistaken judgments. Furthermore, they only identify blocks with similar structures. Some systems use specific HTML tags such as TABLE, TR, TD and P, to partition Web pages, but the proportion of the tags is generally less than DIV tags in CSS Web pages. In this paper, for solving the limitations, we present a system that applies properties of CSS Web page to extract data block. The system comprises three modules: Visual Tree Generation (VTG), Entropy Evaluation Model (EEM) and Block Identification (BI). Web pages are first converted into tree objects in the VTG module. The module transforms DOM trees into visual trees by using the visual information and HTML tag name of nodes to modify tree structure. The proposed visual tree presents the arrangement of data displayed on Web Browser, which meet the visual intention for evaluating informative blocks. If a block consists of diverse content, the block entropy will be relatively high. Thus entropy is proper to measure the information content of blocks for distinguishing presentation blocks from others. In the EEM module, the entropy attributes of each node in a visual tree is calculated. These attributes are used to identify block types by the BI module which comprises the block marking and block Refining. In the experiment, the result shows the node attributes and the visual tree are useful to extract blocks on CSS Web pages. Our system also outperforms with other systems on container block extraction.
|
author2 |
Hung-Yu Kao |
author_facet |
Hung-Yu Kao Wei-Ting Cho 卓威廷 |
author |
Wei-Ting Cho 卓威廷 |
spellingShingle |
Wei-Ting Cho 卓威廷 Visual Tree Evaluation on Block Extraction |
author_sort |
Wei-Ting Cho |
title |
Visual Tree Evaluation on Block Extraction |
title_short |
Visual Tree Evaluation on Block Extraction |
title_full |
Visual Tree Evaluation on Block Extraction |
title_fullStr |
Visual Tree Evaluation on Block Extraction |
title_full_unstemmed |
Visual Tree Evaluation on Block Extraction |
title_sort |
visual tree evaluation on block extraction |
publishDate |
2008 |
url |
http://ndltd.ncl.edu.tw/handle/04206107953791017472 |
work_keys_str_mv |
AT weitingcho visualtreeevaluationonblockextraction AT zhuōwēitíng visualtreeevaluationonblockextraction AT weitingcho jīyúshìjuéshùpínggūzhīqūkuàixiéqǔ AT zhuōwēitíng jīyúshìjuéshùpínggūzhīqūkuàixiéqǔ |
_version_ |
1718133997577437184 |