Summary: | 碩士 === 國立臺北大學 === 統計學系 === 106 === The term frequency is an important quantity in the analysis of text data. It represents the frequency of some specified terms that occurs in the text documents. Therefore, the creation of the structured term frequency matrix from a large number of unstructured text documents is the first step in the process of text mining. More commonly the term frequency matrix is a high-dimensional sparse matrix. However, the result of analysis might be misled if this matrix consists of a large rare terms and/or redundant terms. In addition, the complexity of the analysis would be increased and the accuracy of the results would be reduced. Thus, prior to the text analysis, using dimension reduction techniques to select the most frequent terms from the high-dimensional sparse term frequency matrix could improve the accuracy of the further analysis. In this study, we compare three different dimension reduction methods for high-dimensional sparse data based on the different sparsities and the different ratios of data sizes to dimensions. The methods include the traditional Principal Component Analysis (PCA), the Regularized Principal Component Analysis (rPCA) and the Sparse Principal Component Analysis (SPCA). We conduct the simulation studies and analyze the real text data to investigate the applicability of the dimension reduction methods for
high-dimensional sparse data.
|