Summary: | 碩士 === 國立高雄第一科技大學 === 資訊管理所 === 91 === Data warehousing and data mining techniques are gaining in popularity as organizations realize the benefits of being able to perform multi-dimensional analyses of cumulated historical business data to help contemporary administrative decision-making. However, based on the survey of survey.com, for the business intelligence of an enterprise, there are only about 20% information can be extracted from formatted data stored in relational databases. The remaining 80% information is hidden in unstructured or semi-structured documents. For instances, market survey reports, project status reports, meeting records, customer complain e-mails, patent application sheets, advertisements of competitors are all recorded in documents. Therefore, the next challenge will be the study of topics about document warehousing and text mining to help enterprises on obtaining the complete business intelligence. Since a document is multi-dimensional in nature, traditional indexing methods are not really suitable for a document warehouse. Although a multi-dimensional array can be employed to represent the index of a document warehouse, it usually costs too much as document cubes are usually sparse. That is, if we use a multi-dimensional array to index a document cube, then the space utilization will be poor. In this thesis, based on the concept of R-tree, we propose an index structure called D-tree to fit the requirement of a document cube, and therefore study the related properties of D-Tree to make the indexing process more efficient. We hope such infrastructure can help us to extend our work for combining with text processing technologies to make data warehousing and document warehousing be one of the most important kernel of knowledge management and customer relationship management applications.
|