Summary: | 碩士 === 國立臺灣科技大學 === 資訊工程系 === 105 === This research focuses on how to transform the categorical data into numerical data efficiently. Although there are plenty of encoders for processing the categorical data, there are two severe defects of high informaiton with high dimension or low dimension with low information in each of encoders. Hence, the goal of the proposed method is try to derive a new categorical transform with the encoders as the base, so as to obtain the low dimensional and effective features.
First of all, the proposed method takes OneHotEncoder as the base encoder and try to improve it by combining with k-means. Secondly, this research proposes a kernel trick, called Feature Combination, to extract and to reserve more information from the dataset. The concept of the trick is combining the original categorical columns to create new columns. Feature Combination exactly advances the accuracy of the prediction model, but the bottleneck of the trick comes from its high dimentional output. Therefore, this research proposes Pre-Selection, which selects important columns through Information Gain before executing Feature Combination, to solve the bottleneck, and then make the proposed method achieve the original goal.
The proposed method is evalauted with the categorical data from UCI and CTU. The final results of the experiments show that the features, after transforming from the proposed method, are of dimensions from 1 and 4 according to the numbers of clusters of k-means. Moreover, the accuracy of all the datasets with the proposed method are almost 2 percent higher than OneHotEncoder. Although the improvement in accurancy is not as high as what we expected, the numbers of dimensions of features is at least 20 times lower than that of OneHotEncoder.
|