Summary: | 碩士 === 國立臺灣大學 === 資訊管理學研究所 === 93 === Knowledge bases in a corporation have to process thousands of text-based information every day. Those include competitors’ information, industrial analysis reports, and customer requirements outside the corporation; financial statements, technique reports, and patterns inside the corporation, which are considered crucial for business operation. However, the processes of collecting, filtering, and filing are time and labor consuming tasks. Hence, automatic text classification is required to solve the problem. The issue about the employment of automatic techniques to improve manual classification performance and to meet the requirements of considerable quantities of classification tasks has been raised in the area of information services and knowledge management.
The appropriateness of hierarchy of the knowledge base in the company, the representiveness of texts in the classes, and the consistency of data collection will all affect the performance of text classification. In addition, the method of selecting key terms, the level of understanding of unknown texts, how to achieve the equilibrium between speed and accuracy should be taken into consideration during the construction of automatic text classification systems.
In this research, an automatic text classification system is implemented, and the texts are gathered from the Sinica Corpus. Some machine learning methods and non-machine learning methods will be compared in the thesis. Besides, the effect of varying level of understanding about texts will also be measured. Furthermore, the method of measuring corpus similarity and homogeneity is applied to the classes, in order to measure the appropriateness of predefined classes or texts in those classes.
|