Template-based Information Extraction from Tree-structured HTML Documents

碩士 === 國立臺灣大學 === 資訊工程學系研究所 === 85 === This thesis proposes a novel approach of information extraction by identifying structural components in on-line web documents. The brief description of this approach can be introduced as follows....

Full description

Bibliographic Details
Main Authors: Yih, Wen-tau, 易文韜
Other Authors: Jane Yung-jen Hsu
Format: Others
Language:zh-TW
Published: 1997
Online Access:http://ndltd.ncl.edu.tw/handle/29386387918552988914
id ndltd-TW-085NTU00392033
record_format oai_dc
spelling ndltd-TW-085NTU003920332016-07-01T04:15:37Z http://ndltd.ncl.edu.tw/handle/29386387918552988914 Template-based Information Extraction from Tree-structured HTML Documents 樹狀HTML文件之資料擷取 Yih, Wen-tau 易文韜 碩士 國立臺灣大學 資訊工程學系研究所 85 This thesis proposes a novel approach of information extraction by identifying structural components in on-line web documents. The brief description of this approach can be introduced as follows. First, documents can be modeled using the three ingredients: format, content, and structure. Format is the visualized view of a document. Content is the actual data that a document has. Structure is the logical organization of a document. Structural components can be used as extraction targets. If a mapping relationship has been identified, the corresponding content will then be extracted. With the he lp of content and format, the procedure of finding the mapping relationship is facilitated by a template, which is us ed to simulate the structure of a document. When applying the document model and the concept of template matching to HTML documents, both documents and templates can be represented in tree-structures. The matchin g problem between templates and documents can then be transf ormed into an approximate tree matching problem.After a mapping r elationship has been built, the content that belongs to the target structural component will be the result returned by the s ystem. Experiments have shown the feasi bility of this template-based information extraction approac h. In addition, the systematic procedure of building templa tes also implies the possibility of generation of templates by the means of machine learning. Jane Yung-jen Hsu 許永真 --- 1997 學位論文 ; thesis 105 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立臺灣大學 === 資訊工程學系研究所 === 85 === This thesis proposes a novel approach of information extraction by identifying structural components in on-line web documents. The brief description of this approach can be introduced as follows. First, documents can be modeled using the three ingredients: format, content, and structure. Format is the visualized view of a document. Content is the actual data that a document has. Structure is the logical organization of a document. Structural components can be used as extraction targets. If a mapping relationship has been identified, the corresponding content will then be extracted. With the he lp of content and format, the procedure of finding the mapping relationship is facilitated by a template, which is us ed to simulate the structure of a document. When applying the document model and the concept of template matching to HTML documents, both documents and templates can be represented in tree-structures. The matchin g problem between templates and documents can then be transf ormed into an approximate tree matching problem.After a mapping r elationship has been built, the content that belongs to the target structural component will be the result returned by the s ystem. Experiments have shown the feasi bility of this template-based information extraction approac h. In addition, the systematic procedure of building templa tes also implies the possibility of generation of templates by the means of machine learning.
author2 Jane Yung-jen Hsu
author_facet Jane Yung-jen Hsu
Yih, Wen-tau
易文韜
author Yih, Wen-tau
易文韜
spellingShingle Yih, Wen-tau
易文韜
Template-based Information Extraction from Tree-structured HTML Documents
author_sort Yih, Wen-tau
title Template-based Information Extraction from Tree-structured HTML Documents
title_short Template-based Information Extraction from Tree-structured HTML Documents
title_full Template-based Information Extraction from Tree-structured HTML Documents
title_fullStr Template-based Information Extraction from Tree-structured HTML Documents
title_full_unstemmed Template-based Information Extraction from Tree-structured HTML Documents
title_sort template-based information extraction from tree-structured html documents
publishDate 1997
url http://ndltd.ncl.edu.tw/handle/29386387918552988914
work_keys_str_mv AT yihwentau templatebasedinformationextractionfromtreestructuredhtmldocuments
AT yìwéntāo templatebasedinformationextractionfromtreestructuredhtmldocuments
AT yihwentau shùzhuànghtmlwénjiànzhīzīliàoxiéqǔ
AT yìwéntāo shùzhuànghtmlwénjiànzhīzīliàoxiéqǔ
_version_ 1718328799522717696