New algorithms in factor analysis : applications, model selection and findings in bioinformatics

Advancements in microelectronic devices and computational and storage technologies enable the collection of high volume, high speed and high dimension data in many applications. Due to the high dimensionality of these measurements, exact dependence of the observations on the various parameters or va...

Full description

Bibliographic Details
Main Authors: Wu, Ho-chun, 胡皓竣
Language:English
Published: The University of Hong Kong (Pokfulam, Hong Kong) 2014
Subjects:
Online Access:http://hdl.handle.net/10722/205839
id ndltd-HKU-oai-hub.hku.hk-10722-205839
record_format oai_dc
collection NDLTD
language English
sources NDLTD
topic Factor analysis
Bioinformatics - Mathematical models
spellingShingle Factor analysis
Bioinformatics - Mathematical models
Wu, Ho-chun
胡皓竣
New algorithms in factor analysis : applications, model selection and findings in bioinformatics
description Advancements in microelectronic devices and computational and storage technologies enable the collection of high volume, high speed and high dimension data in many applications. Due to the high dimensionality of these measurements, exact dependence of the observations on the various parameters or variables may not be exactly known. Factor analysis (FA) is a useful multivariate technique to exploit the redundancies among observations and reveal their dependence to some latent variables called factors. Some major issues of the conventional FA are high arithmetic complexity for real-time online implementation, assumption of static system parameters, the demand of interval forecasting, robustness against outlying observations and model selection in problems with high dimension but low number of samples (HDLS). This thesis addresses these issues and proposes new extensions to the existing FA algorithms. First, in order to reduce the arithmetic complexity, we propose new recursive FA algorithms (RFA) that recursively compute only the dominant Principal Components (PCs) and eigenvalues in the major subspace tracked by efficient subspace tracking algorithms. Specifically, two new approaches are proposed for updating the PCs and eigenvalues in the classical fault detection problem with different tradeoff between accuracy and arithmetic complexity, namely rank-1 modification and deflation. They significantly reduce the online arithmetic complexity and allow the adaption to time-varying system parameters. Second, we extend the RFA algorithm to forecasting of time series and propose a new recursive dynamic factor analysis (RDFA) algorithm for electricity price forecasting. While the PCs are recursively tracked by the subspace algorithm, a random walk or a state dynamical model can be incorporated to describe the latest state of the time-varying auto-regressive (AR) model built from the factors. This formulation can be solved by the celebrated Kalman filter (KF), which in turn allows future values to be forecasted with estimated confidence intervals. Third, we propose new robust covariance and outlier detection criteria to improve the robustness of the proposed RFA and RDFA algorithms against outlying observations based on the concept of robust M-estimation. Experimental results show that the proposed methods can effectively suppress the adverse contributions of the outliers on the factors and PCs. Finally, in order to improve the consistency of model selection and facilitate the estimation of p-values in HDLS problems, we propose a new automatic model selection method based on ridge partial least squares and recursive feature elimination. Furthermore, a novel performance criterion is proposed for ranking variables according to their consistency of being chosen in different perturbation of the samples. Using this criterion, the associated p-values can be estimated under the HDLS setting. Experimental results using real gene cancer microarray datasets show that improved prognosis can be obtained by the proposed approach as compared with conventional techniques. Furthermore, to quantify their statistical significance, the p-value of the identified genes are estimated and functional analysis of the significant genes found in the diffused large B-cell lymphoma (DLBCL) gene microarray data is performed to validate the findings. While we focus in a few engineering problems, these algorithms are also applicable to other related applications. === published_or_final_version === Electrical and Electronic Engineering === Doctoral === Doctor of Philosophy
author Wu, Ho-chun
胡皓竣
author_facet Wu, Ho-chun
胡皓竣
author_sort Wu, Ho-chun
title New algorithms in factor analysis : applications, model selection and findings in bioinformatics
title_short New algorithms in factor analysis : applications, model selection and findings in bioinformatics
title_full New algorithms in factor analysis : applications, model selection and findings in bioinformatics
title_fullStr New algorithms in factor analysis : applications, model selection and findings in bioinformatics
title_full_unstemmed New algorithms in factor analysis : applications, model selection and findings in bioinformatics
title_sort new algorithms in factor analysis : applications, model selection and findings in bioinformatics
publisher The University of Hong Kong (Pokfulam, Hong Kong)
publishDate 2014
url http://hdl.handle.net/10722/205839
work_keys_str_mv AT wuhochun newalgorithmsinfactoranalysisapplicationsmodelselectionandfindingsinbioinformatics
AT húhàojùn newalgorithmsinfactoranalysisapplicationsmodelselectionandfindingsinbioinformatics
_version_ 1716814361441337344
spelling ndltd-HKU-oai-hub.hku.hk-10722-2058392015-07-29T04:02:42Z New algorithms in factor analysis : applications, model selection and findings in bioinformatics Wu, Ho-chun 胡皓竣 Factor analysis Bioinformatics - Mathematical models Advancements in microelectronic devices and computational and storage technologies enable the collection of high volume, high speed and high dimension data in many applications. Due to the high dimensionality of these measurements, exact dependence of the observations on the various parameters or variables may not be exactly known. Factor analysis (FA) is a useful multivariate technique to exploit the redundancies among observations and reveal their dependence to some latent variables called factors. Some major issues of the conventional FA are high arithmetic complexity for real-time online implementation, assumption of static system parameters, the demand of interval forecasting, robustness against outlying observations and model selection in problems with high dimension but low number of samples (HDLS). This thesis addresses these issues and proposes new extensions to the existing FA algorithms. First, in order to reduce the arithmetic complexity, we propose new recursive FA algorithms (RFA) that recursively compute only the dominant Principal Components (PCs) and eigenvalues in the major subspace tracked by efficient subspace tracking algorithms. Specifically, two new approaches are proposed for updating the PCs and eigenvalues in the classical fault detection problem with different tradeoff between accuracy and arithmetic complexity, namely rank-1 modification and deflation. They significantly reduce the online arithmetic complexity and allow the adaption to time-varying system parameters. Second, we extend the RFA algorithm to forecasting of time series and propose a new recursive dynamic factor analysis (RDFA) algorithm for electricity price forecasting. While the PCs are recursively tracked by the subspace algorithm, a random walk or a state dynamical model can be incorporated to describe the latest state of the time-varying auto-regressive (AR) model built from the factors. This formulation can be solved by the celebrated Kalman filter (KF), which in turn allows future values to be forecasted with estimated confidence intervals. Third, we propose new robust covariance and outlier detection criteria to improve the robustness of the proposed RFA and RDFA algorithms against outlying observations based on the concept of robust M-estimation. Experimental results show that the proposed methods can effectively suppress the adverse contributions of the outliers on the factors and PCs. Finally, in order to improve the consistency of model selection and facilitate the estimation of p-values in HDLS problems, we propose a new automatic model selection method based on ridge partial least squares and recursive feature elimination. Furthermore, a novel performance criterion is proposed for ranking variables according to their consistency of being chosen in different perturbation of the samples. Using this criterion, the associated p-values can be estimated under the HDLS setting. Experimental results using real gene cancer microarray datasets show that improved prognosis can be obtained by the proposed approach as compared with conventional techniques. Furthermore, to quantify their statistical significance, the p-value of the identified genes are estimated and functional analysis of the significant genes found in the diffused large B-cell lymphoma (DLBCL) gene microarray data is performed to validate the findings. While we focus in a few engineering problems, these algorithms are also applicable to other related applications. published_or_final_version Electrical and Electronic Engineering Doctoral Doctor of Philosophy 2014-10-10T23:13:42Z 2014-10-10T23:13:42Z 2013 PG_Thesis 10.5353/th_b5153672 b5153672 http://hdl.handle.net/10722/205839 eng HKU Theses Online (HKUTO) Creative Commons: Attribution 3.0 Hong Kong License The author retains all proprietary rights, (such as patent rights) and the right to use in future works. The University of Hong Kong (Pokfulam, Hong Kong)