A low dimensional categorical data transform based on Feature Combination

碩士 === 國立臺灣科技大學 === 資訊工程系 === 105 === This research focuses on how to transform the categorical data into numerical data efficiently. Although there are plenty of encoders for processing the categorical data, there are two severe defects of high informaiton with high dimension or low dimension with...

Full description

Bibliographic Details
Main Authors: Wie-Zhih Lin, 林威志
Other Authors: Wei-Chung Teng
Format: Others
Language:zh-TW
Published: 2017
Online Access:http://ndltd.ncl.edu.tw/handle/7k2q7u
id ndltd-TW-105NTUS5392063
record_format oai_dc
spelling ndltd-TW-105NTUS53920632019-05-15T23:46:35Z http://ndltd.ncl.edu.tw/handle/7k2q7u A low dimensional categorical data transform based on Feature Combination 一個基於特徵組合之類別資料低維度轉換方法 Wie-Zhih Lin 林威志 碩士 國立臺灣科技大學 資訊工程系 105 This research focuses on how to transform the categorical data into numerical data efficiently. Although there are plenty of encoders for processing the categorical data, there are two severe defects of high informaiton with high dimension or low dimension with low information in each of encoders. Hence, the goal of the proposed method is try to derive a new categorical transform with the encoders as the base, so as to obtain the low dimensional and effective features. First of all, the proposed method takes OneHotEncoder as the base encoder and try to improve it by combining with k-means. Secondly, this research proposes a kernel trick, called Feature Combination, to extract and to reserve more information from the dataset. The concept of the trick is combining the original categorical columns to create new columns. Feature Combination exactly advances the accuracy of the prediction model, but the bottleneck of the trick comes from its high dimentional output. Therefore, this research proposes Pre-Selection, which selects important columns through Information Gain before executing Feature Combination, to solve the bottleneck, and then make the proposed method achieve the original goal. The proposed method is evalauted with the categorical data from UCI and CTU. The final results of the experiments show that the features, after transforming from the proposed method, are of dimensions from 1 and 4 according to the numbers of clusters of k-means. Moreover, the accuracy of all the datasets with the proposed method are almost 2 percent higher than OneHotEncoder. Although the improvement in accurancy is not as high as what we expected, the numbers of dimensions of features is at least 20 times lower than that of OneHotEncoder. Wei-Chung Teng 鄧惟中 2017 學位論文 ; thesis 53 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立臺灣科技大學 === 資訊工程系 === 105 === This research focuses on how to transform the categorical data into numerical data efficiently. Although there are plenty of encoders for processing the categorical data, there are two severe defects of high informaiton with high dimension or low dimension with low information in each of encoders. Hence, the goal of the proposed method is try to derive a new categorical transform with the encoders as the base, so as to obtain the low dimensional and effective features. First of all, the proposed method takes OneHotEncoder as the base encoder and try to improve it by combining with k-means. Secondly, this research proposes a kernel trick, called Feature Combination, to extract and to reserve more information from the dataset. The concept of the trick is combining the original categorical columns to create new columns. Feature Combination exactly advances the accuracy of the prediction model, but the bottleneck of the trick comes from its high dimentional output. Therefore, this research proposes Pre-Selection, which selects important columns through Information Gain before executing Feature Combination, to solve the bottleneck, and then make the proposed method achieve the original goal. The proposed method is evalauted with the categorical data from UCI and CTU. The final results of the experiments show that the features, after transforming from the proposed method, are of dimensions from 1 and 4 according to the numbers of clusters of k-means. Moreover, the accuracy of all the datasets with the proposed method are almost 2 percent higher than OneHotEncoder. Although the improvement in accurancy is not as high as what we expected, the numbers of dimensions of features is at least 20 times lower than that of OneHotEncoder.
author2 Wei-Chung Teng
author_facet Wei-Chung Teng
Wie-Zhih Lin
林威志
author Wie-Zhih Lin
林威志
spellingShingle Wie-Zhih Lin
林威志
A low dimensional categorical data transform based on Feature Combination
author_sort Wie-Zhih Lin
title A low dimensional categorical data transform based on Feature Combination
title_short A low dimensional categorical data transform based on Feature Combination
title_full A low dimensional categorical data transform based on Feature Combination
title_fullStr A low dimensional categorical data transform based on Feature Combination
title_full_unstemmed A low dimensional categorical data transform based on Feature Combination
title_sort low dimensional categorical data transform based on feature combination
publishDate 2017
url http://ndltd.ncl.edu.tw/handle/7k2q7u
work_keys_str_mv AT wiezhihlin alowdimensionalcategoricaldatatransformbasedonfeaturecombination
AT línwēizhì alowdimensionalcategoricaldatatransformbasedonfeaturecombination
AT wiezhihlin yīgèjīyútèzhēngzǔhézhīlèibiézīliàodīwéidùzhuǎnhuànfāngfǎ
AT línwēizhì yīgèjīyútèzhēngzǔhézhīlèibiézīliàodīwéidùzhuǎnhuànfāngfǎ
AT wiezhihlin lowdimensionalcategoricaldatatransformbasedonfeaturecombination
AT línwēizhì lowdimensionalcategoricaldatatransformbasedonfeaturecombination
_version_ 1719153208107991040