Multiple Suboptimal Solutions for Prediction Rules in Gene Expression Data

This paper discusses mathematical and statistical aspects in analysis methods applied to microarray gene expressions. We focus on pattern recognition to extract informative features embedded in the data for prediction of phenotypes. It has been pointed out that there are severely difficult problems...

Full description

Bibliographic Details
Main Authors: Osamu Komori, Mari Pritchard, Shinto Eguchi
Format: Article
Language:English
Published: Hindawi Limited 2013-01-01
Series:Computational and Mathematical Methods in Medicine
Online Access:http://dx.doi.org/10.1155/2013/798189
id doaj-66cdb1fb8ade48ecb07d90fc1c1a8046
record_format Article
spelling doaj-66cdb1fb8ade48ecb07d90fc1c1a80462020-11-24T21:30:00ZengHindawi LimitedComputational and Mathematical Methods in Medicine1748-670X1748-67182013-01-01201310.1155/2013/798189798189Multiple Suboptimal Solutions for Prediction Rules in Gene Expression DataOsamu Komori0Mari Pritchard1Shinto Eguchi2The Institute of Statistical Mathematics, Midori-cho, Tachikawa, Tokyo 190-8562, JapanCLC Bio Japan, Inc., Daikanyama Park Side Village 204, 9-8 Sarugakucho, Shibuya-ku, Tokyo 150-0033, JapanThe Institute of Statistical Mathematics, Midori-cho, Tachikawa, Tokyo 190-8562, JapanThis paper discusses mathematical and statistical aspects in analysis methods applied to microarray gene expressions. We focus on pattern recognition to extract informative features embedded in the data for prediction of phenotypes. It has been pointed out that there are severely difficult problems due to the unbalance in the number of observed genes compared with the number of observed subjects. We make a reanalysis of microarray gene expression published data to detect many other gene sets with almost the same performance. We conclude in the current stage that it is not possible to extract only informative genes with high performance in the all observed genes. We investigate the reason why this difficulty still exists even though there are actively proposed analysis methods and learning algorithms in statistical machine learning approaches. We focus on the mutual coherence or the absolute value of the Pearson correlations between two genes and describe the distributions of the correlation for the selected set of genes and the total set. We show that the problem of finding informative genes in high dimensional data is ill-posed and that the difficulty is closely related with the mutual coherence.http://dx.doi.org/10.1155/2013/798189
collection DOAJ
language English
format Article
sources DOAJ
author Osamu Komori
Mari Pritchard
Shinto Eguchi
spellingShingle Osamu Komori
Mari Pritchard
Shinto Eguchi
Multiple Suboptimal Solutions for Prediction Rules in Gene Expression Data
Computational and Mathematical Methods in Medicine
author_facet Osamu Komori
Mari Pritchard
Shinto Eguchi
author_sort Osamu Komori
title Multiple Suboptimal Solutions for Prediction Rules in Gene Expression Data
title_short Multiple Suboptimal Solutions for Prediction Rules in Gene Expression Data
title_full Multiple Suboptimal Solutions for Prediction Rules in Gene Expression Data
title_fullStr Multiple Suboptimal Solutions for Prediction Rules in Gene Expression Data
title_full_unstemmed Multiple Suboptimal Solutions for Prediction Rules in Gene Expression Data
title_sort multiple suboptimal solutions for prediction rules in gene expression data
publisher Hindawi Limited
series Computational and Mathematical Methods in Medicine
issn 1748-670X
1748-6718
publishDate 2013-01-01
description This paper discusses mathematical and statistical aspects in analysis methods applied to microarray gene expressions. We focus on pattern recognition to extract informative features embedded in the data for prediction of phenotypes. It has been pointed out that there are severely difficult problems due to the unbalance in the number of observed genes compared with the number of observed subjects. We make a reanalysis of microarray gene expression published data to detect many other gene sets with almost the same performance. We conclude in the current stage that it is not possible to extract only informative genes with high performance in the all observed genes. We investigate the reason why this difficulty still exists even though there are actively proposed analysis methods and learning algorithms in statistical machine learning approaches. We focus on the mutual coherence or the absolute value of the Pearson correlations between two genes and describe the distributions of the correlation for the selected set of genes and the total set. We show that the problem of finding informative genes in high dimensional data is ill-posed and that the difficulty is closely related with the mutual coherence.
url http://dx.doi.org/10.1155/2013/798189
work_keys_str_mv AT osamukomori multiplesuboptimalsolutionsforpredictionrulesingeneexpressiondata
AT maripritchard multiplesuboptimalsolutionsforpredictionrulesingeneexpressiondata
AT shintoeguchi multiplesuboptimalsolutionsforpredictionrulesingeneexpressiondata
_version_ 1725964594937069568