Template-based Information Extraction from Tree-structured HTML Documents

碩士 === 國立臺灣大學 === 資訊工程學系研究所 === 85 === This thesis proposes a novel approach of information extraction by identifying structural components in on-line web documents. The brief description of this approach can be introduced as follows....

Full description

Bibliographic Details
Main Authors:	Yih, Wen-tau, 易文韜
Other Authors:	Jane Yung-jen Hsu
Format:	Others
Language:	zh-TW
Published:	1997
Online Access:	http://ndltd.ncl.edu.tw/handle/29386387918552988914

id	ndltd-TW-085NTU00392033
record_format	oai_dc
spelling	ndltd-TW-085NTU003920332016-07-01T04:15:37Z http://ndltd.ncl.edu.tw/handle/29386387918552988914 Template-based Information Extraction from Tree-structured HTML Documents 樹狀HTML文件之資料擷取 Yih, Wen-tau 易文韜碩士國立臺灣大學資訊工程學系研究所 85 This thesis proposes a novel approach of information extraction by identifying structural components in on-line web documents. The brief description of this approach can be introduced as follows. First, documents can be modeled using the three ingredients: format, content, and structure. Format is the visualized view of a document. Content is the actual data that a document has. Structure is the logical organization of a document. Structural components can be used as extraction targets. If a mapping relationship has been identified, the corresponding content will then be extracted. With the he lp of content and format, the procedure of finding the mapping relationship is facilitated by a template, which is us ed to simulate the structure of a document. When applying the document model and the concept of template matching to HTML documents, both documents and templates can be represented in tree-structures. The matchin g problem between templates and documents can then be transf ormed into an approximate tree matching problem.After a mapping r elationship has been built, the content that belongs to the target structural component will be the result returned by the s ystem. Experiments have shown the feasi bility of this template-based information extraction approac h. In addition, the systematic procedure of building templa tes also implies the possibility of generation of templates by the means of machine learning. Jane Yung-jen Hsu 許永真 --- 1997 學位論文 ; thesis 105 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	碩士 === 國立臺灣大學 === 資訊工程學系研究所 === 85 === This thesis proposes a novel approach of information extraction by identifying structural components in on-line web documents. The brief description of this approach can be introduced as follows. First, documents can be modeled using the three ingredients: format, content, and structure. Format is the visualized view of a document. Content is the actual data that a document has. Structure is the logical organization of a document. Structural components can be used as extraction targets. If a mapping relationship has been identified, the corresponding content will then be extracted. With the he lp of content and format, the procedure of finding the mapping relationship is facilitated by a template, which is us ed to simulate the structure of a document. When applying the document model and the concept of template matching to HTML documents, both documents and templates can be represented in tree-structures. The matchin g problem between templates and documents can then be transf ormed into an approximate tree matching problem.After a mapping r elationship has been built, the content that belongs to the target structural component will be the result returned by the s ystem. Experiments have shown the feasi bility of this template-based information extraction approac h. In addition, the systematic procedure of building templa tes also implies the possibility of generation of templates by the means of machine learning.
author2	Jane Yung-jen Hsu
author_facet	Jane Yung-jen Hsu Yih, Wen-tau 易文韜
author	Yih, Wen-tau 易文韜
spellingShingle	Yih, Wen-tau 易文韜 Template-based Information Extraction from Tree-structured HTML Documents
author_sort	Yih, Wen-tau
title	Template-based Information Extraction from Tree-structured HTML Documents
title_short	Template-based Information Extraction from Tree-structured HTML Documents
title_full	Template-based Information Extraction from Tree-structured HTML Documents
title_fullStr	Template-based Information Extraction from Tree-structured HTML Documents
title_full_unstemmed	Template-based Information Extraction from Tree-structured HTML Documents
title_sort	template-based information extraction from tree-structured html documents
publishDate	1997
url	http://ndltd.ncl.edu.tw/handle/29386387918552988914
work_keys_str_mv	AT yihwentau templatebasedinformationextractionfromtreestructuredhtmldocuments AT yìwéntāo templatebasedinformationextractionfromtreestructuredhtmldocuments AT yihwentau shùzhuànghtmlwénjiànzhīzīliàoxiéqǔ AT yìwéntāo shùzhuànghtmlwénjiànzhīzīliàoxiéqǔ
_version_	1718328799522717696

Template-based Information Extraction from Tree-structured HTML Documents

Similar Items