Kernel-Based Data Mining Approach with Variable Selection for Nonlinear High-Dimensional Data

In statistical data mining research, datasets often have nonlinearity and high-dimensionality. It has become difficult to analyze such datasets in a comprehensive manner using traditional statistical methodologies. Kernel-based data mining is one of the most effective statistical methodologies to in...

Full description

Bibliographic Details
Main Author: Baek, Seung Hyun
Format: Others
Published: Trace: Tennessee Research and Creative Exchange 2010
Subjects:
Online Access:http://trace.tennessee.edu/utk_graddiss/676
id ndltd-UTENN-oai-trace.tennessee.edu-utk_graddiss-1758
record_format oai_dc
spelling ndltd-UTENN-oai-trace.tennessee.edu-utk_graddiss-17582011-12-13T16:02:54Z Kernel-Based Data Mining Approach with Variable Selection for Nonlinear High-Dimensional Data Baek, Seung Hyun In statistical data mining research, datasets often have nonlinearity and high-dimensionality. It has become difficult to analyze such datasets in a comprehensive manner using traditional statistical methodologies. Kernel-based data mining is one of the most effective statistical methodologies to investigate a variety of problems in areas including pattern recognition, machine learning, bioinformatics, chemometrics, and statistics. In particular, statistically-sophisticated procedures that emphasize the reliability of results and computational efficiency are required for the analysis of high-dimensional data. In this dissertation, first, a novel wrapper method called SVM-ICOMP-RFE based on hybridized support vector machine (SVM) and recursive feature elimination (RFE) with information-theoretic measure of complexity (ICOMP) is introduced and developed to classify high-dimensional data sets and to carry out subset selection of the variables in the original data space for finding the best for discriminating between groups. Recursive feature elimination (RFE) ranks variables based on the information-theoretic measure of complexity (ICOMP) criterion. Second, a dual variables functional support vector machine approach is proposed. The proposed approach uses both the first and second derivatives of the degradation profiles. The modified floating search algorithm for the repeated variable selection, with newly-added degradation path points, is presented to find a few good variables while reducing the computation time for on-line implementation. Third, a two-stage scheme for the classification of near infrared (NIR) spectral data is proposed. In the first stage, the proposed multi-scale vertical energy thresholding (MSVET) procedure is used to reduce the dimension of the high-dimensional spectral data. In the second stage, a few important wavelet coefficients are selected using the proposed SVM gradient-recursive feature elimination (RFE). Fourth, a novel methodology based on a human decision making process for discriminant analysis called PDCM is proposed. The proposed methodology consists of three basic steps emulating the thinking process: perception, decision, and cognition. In these steps two concepts known as support vector machines for classification and information complexity are integrated to evaluate learning models. 2010-05-01 text application/pdf http://trace.tennessee.edu/utk_graddiss/676 Doctoral Dissertations Trace: Tennessee Research and Creative Exchange Classification Support Vector Machine Information Complexity Wavelet Thresholding Recursive Feature Elimination Floating Search Industrial Engineering
collection NDLTD
format Others
sources NDLTD
topic Classification
Support Vector Machine
Information Complexity
Wavelet Thresholding
Recursive Feature Elimination
Floating Search
Industrial Engineering
spellingShingle Classification
Support Vector Machine
Information Complexity
Wavelet Thresholding
Recursive Feature Elimination
Floating Search
Industrial Engineering
Baek, Seung Hyun
Kernel-Based Data Mining Approach with Variable Selection for Nonlinear High-Dimensional Data
description In statistical data mining research, datasets often have nonlinearity and high-dimensionality. It has become difficult to analyze such datasets in a comprehensive manner using traditional statistical methodologies. Kernel-based data mining is one of the most effective statistical methodologies to investigate a variety of problems in areas including pattern recognition, machine learning, bioinformatics, chemometrics, and statistics. In particular, statistically-sophisticated procedures that emphasize the reliability of results and computational efficiency are required for the analysis of high-dimensional data. In this dissertation, first, a novel wrapper method called SVM-ICOMP-RFE based on hybridized support vector machine (SVM) and recursive feature elimination (RFE) with information-theoretic measure of complexity (ICOMP) is introduced and developed to classify high-dimensional data sets and to carry out subset selection of the variables in the original data space for finding the best for discriminating between groups. Recursive feature elimination (RFE) ranks variables based on the information-theoretic measure of complexity (ICOMP) criterion. Second, a dual variables functional support vector machine approach is proposed. The proposed approach uses both the first and second derivatives of the degradation profiles. The modified floating search algorithm for the repeated variable selection, with newly-added degradation path points, is presented to find a few good variables while reducing the computation time for on-line implementation. Third, a two-stage scheme for the classification of near infrared (NIR) spectral data is proposed. In the first stage, the proposed multi-scale vertical energy thresholding (MSVET) procedure is used to reduce the dimension of the high-dimensional spectral data. In the second stage, a few important wavelet coefficients are selected using the proposed SVM gradient-recursive feature elimination (RFE). Fourth, a novel methodology based on a human decision making process for discriminant analysis called PDCM is proposed. The proposed methodology consists of three basic steps emulating the thinking process: perception, decision, and cognition. In these steps two concepts known as support vector machines for classification and information complexity are integrated to evaluate learning models.
author Baek, Seung Hyun
author_facet Baek, Seung Hyun
author_sort Baek, Seung Hyun
title Kernel-Based Data Mining Approach with Variable Selection for Nonlinear High-Dimensional Data
title_short Kernel-Based Data Mining Approach with Variable Selection for Nonlinear High-Dimensional Data
title_full Kernel-Based Data Mining Approach with Variable Selection for Nonlinear High-Dimensional Data
title_fullStr Kernel-Based Data Mining Approach with Variable Selection for Nonlinear High-Dimensional Data
title_full_unstemmed Kernel-Based Data Mining Approach with Variable Selection for Nonlinear High-Dimensional Data
title_sort kernel-based data mining approach with variable selection for nonlinear high-dimensional data
publisher Trace: Tennessee Research and Creative Exchange
publishDate 2010
url http://trace.tennessee.edu/utk_graddiss/676
work_keys_str_mv AT baekseunghyun kernelbaseddataminingapproachwithvariableselectionfornonlinearhighdimensionaldata
_version_ 1716389952554532864