A low dimensional categorical data transform based on Feature Combination

碩士 === 國立臺灣科技大學 === 資訊工程系 === 105 === This research focuses on how to transform the categorical data into numerical data efficiently. Although there are plenty of encoders for processing the categorical data, there are two severe defects of high informaiton with high dimension or low dimension with...

Full description

Bibliographic Details
Main Authors:	Wie-Zhih Lin, 林威志
Other Authors:	Wei-Chung Teng
Format:	Others
Language:	zh-TW
Published:	2017
Online Access:	http://ndltd.ncl.edu.tw/handle/7k2q7u

id	ndltd-TW-105NTUS5392063
record_format	oai_dc
spelling	ndltd-TW-105NTUS53920632019-05-15T23:46:35Z http://ndltd.ncl.edu.tw/handle/7k2q7u A low dimensional categorical data transform based on Feature Combination 一個基於特徵組合之類別資料低維度轉換方法 Wie-Zhih Lin 林威志碩士國立臺灣科技大學資訊工程系 105 This research focuses on how to transform the categorical data into numerical data efficiently. Although there are plenty of encoders for processing the categorical data, there are two severe defects of high informaiton with high dimension or low dimension with low information in each of encoders. Hence, the goal of the proposed method is try to derive a new categorical transform with the encoders as the base, so as to obtain the low dimensional and effective features. First of all, the proposed method takes OneHotEncoder as the base encoder and try to improve it by combining with k-means. Secondly, this research proposes a kernel trick, called Feature Combination, to extract and to reserve more information from the dataset. The concept of the trick is combining the original categorical columns to create new columns. Feature Combination exactly advances the accuracy of the prediction model, but the bottleneck of the trick comes from its high dimentional output. Therefore, this research proposes Pre-Selection, which selects important columns through Information Gain before executing Feature Combination, to solve the bottleneck, and then make the proposed method achieve the original goal. The proposed method is evalauted with the categorical data from UCI and CTU. The final results of the experiments show that the features, after transforming from the proposed method, are of dimensions from 1 and 4 according to the numbers of clusters of k-means. Moreover, the accuracy of all the datasets with the proposed method are almost 2 percent higher than OneHotEncoder. Although the improvement in accurancy is not as high as what we expected, the numbers of dimensions of features is at least 20 times lower than that of OneHotEncoder. Wei-Chung Teng 鄧惟中 2017 學位論文 ; thesis 53 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	碩士 === 國立臺灣科技大學 === 資訊工程系 === 105 === This research focuses on how to transform the categorical data into numerical data efficiently. Although there are plenty of encoders for processing the categorical data, there are two severe defects of high informaiton with high dimension or low dimension with low information in each of encoders. Hence, the goal of the proposed method is try to derive a new categorical transform with the encoders as the base, so as to obtain the low dimensional and effective features. First of all, the proposed method takes OneHotEncoder as the base encoder and try to improve it by combining with k-means. Secondly, this research proposes a kernel trick, called Feature Combination, to extract and to reserve more information from the dataset. The concept of the trick is combining the original categorical columns to create new columns. Feature Combination exactly advances the accuracy of the prediction model, but the bottleneck of the trick comes from its high dimentional output. Therefore, this research proposes Pre-Selection, which selects important columns through Information Gain before executing Feature Combination, to solve the bottleneck, and then make the proposed method achieve the original goal. The proposed method is evalauted with the categorical data from UCI and CTU. The final results of the experiments show that the features, after transforming from the proposed method, are of dimensions from 1 and 4 according to the numbers of clusters of k-means. Moreover, the accuracy of all the datasets with the proposed method are almost 2 percent higher than OneHotEncoder. Although the improvement in accurancy is not as high as what we expected, the numbers of dimensions of features is at least 20 times lower than that of OneHotEncoder.
author2	Wei-Chung Teng
author_facet	Wei-Chung Teng Wie-Zhih Lin 林威志
author	Wie-Zhih Lin 林威志
spellingShingle	Wie-Zhih Lin 林威志 A low dimensional categorical data transform based on Feature Combination
author_sort	Wie-Zhih Lin
title	A low dimensional categorical data transform based on Feature Combination
title_short	A low dimensional categorical data transform based on Feature Combination
title_full	A low dimensional categorical data transform based on Feature Combination
title_fullStr	A low dimensional categorical data transform based on Feature Combination
title_full_unstemmed	A low dimensional categorical data transform based on Feature Combination
title_sort	low dimensional categorical data transform based on feature combination
publishDate	2017
url	http://ndltd.ncl.edu.tw/handle/7k2q7u
work_keys_str_mv	AT wiezhihlin alowdimensionalcategoricaldatatransformbasedonfeaturecombination AT línwēizhì alowdimensionalcategoricaldatatransformbasedonfeaturecombination AT wiezhihlin yīgèjīyútèzhēngzǔhézhīlèibiézīliàodīwéidùzhuǎnhuànfāngfǎ AT línwēizhì yīgèjīyútèzhēngzǔhézhīlèibiézīliàodīwéidùzhuǎnhuànfāngfǎ AT wiezhihlin lowdimensionalcategoricaldatatransformbasedonfeaturecombination AT línwēizhì lowdimensionalcategoricaldatatransformbasedonfeaturecombination
_version_	1719153208107991040

A low dimensional categorical data transform based on Feature Combination

Similar Items