Some Advances in Classifying and Modeling Complex Data

In statistical methodology of analyzing data, two of the most commonly used techniques are classification and regression modeling. As scientific technology progresses rapidly, complex data often occurs and requires novel classification and regression modeling methodologies according to the data stru...

Full description

Bibliographic Details
Main Author: Zhang, Angang
Other Authors: Statistics
Format: Others
Published: Virginia Tech 2017
Subjects:
Online Access:http://hdl.handle.net/10919/77958
id ndltd-VTETD-oai-vtechworks.lib.vt.edu-10919-77958
record_format oai_dc
spelling ndltd-VTETD-oai-vtechworks.lib.vt.edu-10919-779582020-09-29T05:31:03Z Some Advances in Classifying and Modeling Complex Data Zhang, Angang Statistics Deng, Xinwei Kim, Inyoung Smith, Eric P. Hong, Yili A/B testing fraud detection linear classifier misclassification error net profit value reject inference sparse linear discriminant analysis two-class classification variance component mixed model. In statistical methodology of analyzing data, two of the most commonly used techniques are classification and regression modeling. As scientific technology progresses rapidly, complex data often occurs and requires novel classification and regression modeling methodologies according to the data structure. In this dissertation, I mainly focus on developing a few approaches for analyzing the data with complex structures. Classification problems commonly occur in many areas such as biomedical, marketing, sociology and image recognition. Among various classification methods, linear classifiers have been widely used because of computational advantages, ease of implementation and interpretation compared with non-linear classifiers. Specifically, linear discriminant analysis (LDA) is one of the most important methods in the family of linear classifiers. For high dimensional data with number of variables p larger than the number of observations n occurs more frequently, it calls for advanced classification techniques. In Chapter 2, I proposed a novel sparse LDA method which generalizes LDA through a regularized approach for the two-class classification problem. The proposed method can obtain an accurate classification accuracy with attractive computation, which is suitable for high dimensional data with p>n. In Chapter 3, I deal with the classification when the data complexity lies in the non-random missing responses in the training data set. Appropriate classification method needs to be developed accordingly. Specifically, I considered the "reject inference problem'' for the application of fraud detection for online business. For online business, to prevent fraud transactions, suspicious transactions are rejected with unknown fraud status, yielding a training data with selective missing response. A two-stage modeling approach using logistic regression is proposed to enhance the efficiency and accuracy of fraud detection. Besides the classification problem, data from designed experiments in scientific areas often have complex structures. Many experiments are conducted with multiple variance sources. To increase the accuracy of the statistical modeling, the model need to be able to accommodate more than one error terms. In Chapter 4, I propose a variance component mixed model for a nano material experiment data to address the between group, within group and within subject variance components into a single model. To adjust possible systematic error introduced during the experiment, adjustment terms can be added. Specifically a group adaptive forward and backward selection (GFoBa) procedure is designed to select the significant adjustment terms. Ph. D. 2017-06-09T06:00:15Z 2017-06-09T06:00:15Z 2015-12-16 Dissertation vt_gsexam:6944 http://hdl.handle.net/10919/77958 In Copyright http://rightsstatements.org/vocab/InC/1.0/ ETD application/pdf Virginia Tech
collection NDLTD
format Others
sources NDLTD
topic A/B testing
fraud detection
linear classifier
misclassification error
net profit value
reject inference
sparse linear discriminant analysis
two-class classification
variance component mixed model.
spellingShingle A/B testing
fraud detection
linear classifier
misclassification error
net profit value
reject inference
sparse linear discriminant analysis
two-class classification
variance component mixed model.
Zhang, Angang
Some Advances in Classifying and Modeling Complex Data
description In statistical methodology of analyzing data, two of the most commonly used techniques are classification and regression modeling. As scientific technology progresses rapidly, complex data often occurs and requires novel classification and regression modeling methodologies according to the data structure. In this dissertation, I mainly focus on developing a few approaches for analyzing the data with complex structures. Classification problems commonly occur in many areas such as biomedical, marketing, sociology and image recognition. Among various classification methods, linear classifiers have been widely used because of computational advantages, ease of implementation and interpretation compared with non-linear classifiers. Specifically, linear discriminant analysis (LDA) is one of the most important methods in the family of linear classifiers. For high dimensional data with number of variables p larger than the number of observations n occurs more frequently, it calls for advanced classification techniques. In Chapter 2, I proposed a novel sparse LDA method which generalizes LDA through a regularized approach for the two-class classification problem. The proposed method can obtain an accurate classification accuracy with attractive computation, which is suitable for high dimensional data with p>n. In Chapter 3, I deal with the classification when the data complexity lies in the non-random missing responses in the training data set. Appropriate classification method needs to be developed accordingly. Specifically, I considered the "reject inference problem'' for the application of fraud detection for online business. For online business, to prevent fraud transactions, suspicious transactions are rejected with unknown fraud status, yielding a training data with selective missing response. A two-stage modeling approach using logistic regression is proposed to enhance the efficiency and accuracy of fraud detection. Besides the classification problem, data from designed experiments in scientific areas often have complex structures. Many experiments are conducted with multiple variance sources. To increase the accuracy of the statistical modeling, the model need to be able to accommodate more than one error terms. In Chapter 4, I propose a variance component mixed model for a nano material experiment data to address the between group, within group and within subject variance components into a single model. To adjust possible systematic error introduced during the experiment, adjustment terms can be added. Specifically a group adaptive forward and backward selection (GFoBa) procedure is designed to select the significant adjustment terms. === Ph. D.
author2 Statistics
author_facet Statistics
Zhang, Angang
author Zhang, Angang
author_sort Zhang, Angang
title Some Advances in Classifying and Modeling Complex Data
title_short Some Advances in Classifying and Modeling Complex Data
title_full Some Advances in Classifying and Modeling Complex Data
title_fullStr Some Advances in Classifying and Modeling Complex Data
title_full_unstemmed Some Advances in Classifying and Modeling Complex Data
title_sort some advances in classifying and modeling complex data
publisher Virginia Tech
publishDate 2017
url http://hdl.handle.net/10919/77958
work_keys_str_mv AT zhangangang someadvancesinclassifyingandmodelingcomplexdata
_version_ 1719343191383080960