Template-based Information Extraction from Tree-structured HTML Documents
碩士 === 國立臺灣大學 === 資訊工程學系研究所 === 85 === This thesis proposes a novel approach of information extraction by identifying structural components in on-line web documents. The brief description of this approach can be introduced as follows....
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | zh-TW |
Published: |
1997
|
Online Access: | http://ndltd.ncl.edu.tw/handle/29386387918552988914 |
id |
ndltd-TW-085NTU00392033 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-085NTU003920332016-07-01T04:15:37Z http://ndltd.ncl.edu.tw/handle/29386387918552988914 Template-based Information Extraction from Tree-structured HTML Documents 樹狀HTML文件之資料擷取 Yih, Wen-tau 易文韜 碩士 國立臺灣大學 資訊工程學系研究所 85 This thesis proposes a novel approach of information extraction by identifying structural components in on-line web documents. The brief description of this approach can be introduced as follows. First, documents can be modeled using the three ingredients: format, content, and structure. Format is the visualized view of a document. Content is the actual data that a document has. Structure is the logical organization of a document. Structural components can be used as extraction targets. If a mapping relationship has been identified, the corresponding content will then be extracted. With the he lp of content and format, the procedure of finding the mapping relationship is facilitated by a template, which is us ed to simulate the structure of a document. When applying the document model and the concept of template matching to HTML documents, both documents and templates can be represented in tree-structures. The matchin g problem between templates and documents can then be transf ormed into an approximate tree matching problem.After a mapping r elationship has been built, the content that belongs to the target structural component will be the result returned by the s ystem. Experiments have shown the feasi bility of this template-based information extraction approac h. In addition, the systematic procedure of building templa tes also implies the possibility of generation of templates by the means of machine learning. Jane Yung-jen Hsu 許永真 --- 1997 學位論文 ; thesis 105 zh-TW |
collection |
NDLTD |
language |
zh-TW |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立臺灣大學 === 資訊工程學系研究所 === 85 === This thesis proposes a novel approach of information extraction by
identifying structural components in on-line web documents. The
brief description of this approach can be introduced as follows.
First, documents can be modeled using the three ingredients:
format, content, and structure. Format is the
visualized view of a document. Content is the actual data
that a document has. Structure is the logical organization of
a document. Structural components can be used as extraction
targets. If a mapping relationship has been identified, the
corresponding content will then be extracted. With the he
lp of content and format, the procedure of finding the mapping
relationship is facilitated by a template, which is us
ed to simulate the structure of a document.
When applying the document model and the concept
of template matching to HTML documents, both documents and
templates can be represented in tree-structures. The matchin
g problem between templates and documents can then be transf
ormed into an approximate tree matching problem.After a mapping r
elationship has been built, the content that belongs to the
target structural component will be the result returned by the s
ystem.
Experiments have shown the feasi
bility of this template-based information extraction approac
h. In addition, the systematic procedure of building templa
tes also implies the possibility of generation of templates by
the means of machine learning.
|
author2 |
Jane Yung-jen Hsu |
author_facet |
Jane Yung-jen Hsu Yih, Wen-tau 易文韜 |
author |
Yih, Wen-tau 易文韜 |
spellingShingle |
Yih, Wen-tau 易文韜 Template-based Information Extraction from Tree-structured HTML Documents |
author_sort |
Yih, Wen-tau |
title |
Template-based Information Extraction from Tree-structured HTML Documents |
title_short |
Template-based Information Extraction from Tree-structured HTML Documents |
title_full |
Template-based Information Extraction from Tree-structured HTML Documents |
title_fullStr |
Template-based Information Extraction from Tree-structured HTML Documents |
title_full_unstemmed |
Template-based Information Extraction from Tree-structured HTML Documents |
title_sort |
template-based information extraction from tree-structured html documents |
publishDate |
1997 |
url |
http://ndltd.ncl.edu.tw/handle/29386387918552988914 |
work_keys_str_mv |
AT yihwentau templatebasedinformationextractionfromtreestructuredhtmldocuments AT yìwéntāo templatebasedinformationextractionfromtreestructuredhtmldocuments AT yihwentau shùzhuànghtmlwénjiànzhīzīliàoxiéqǔ AT yìwéntāo shùzhuànghtmlwénjiànzhīzīliàoxiéqǔ |
_version_ |
1718328799522717696 |