Identification of lung cancer gene markers through kernel maximum mean discrepancy and information entropy

Abstract Background The early diagnosis of lung cancer has been a critical problem in clinical practice for a long time and identifying differentially expressed gene as disease marker is a promising solution. However, the most existing gene differential expression analysis (DEA) methods have two mai...

Full description

Bibliographic Details
Main Authors: Zhixun Zhao, Hui Peng, Xiaocai Zhang, Yi Zheng, Fang Chen, Liang Fang, Jinyan Li
Format: Article
Language:English
Published: BMC 2019-12-01
Series:BMC Medical Genomics
Subjects:
Online Access:https://doi.org/10.1186/s12920-019-0630-4
id doaj-d48a875d17994d18a1561723ea29972b
record_format Article
spelling doaj-d48a875d17994d18a1561723ea29972b2021-04-02T20:48:33ZengBMCBMC Medical Genomics1755-87942019-12-0112S811010.1186/s12920-019-0630-4Identification of lung cancer gene markers through kernel maximum mean discrepancy and information entropyZhixun Zhao0Hui Peng1Xiaocai Zhang2Yi Zheng3Fang Chen4Liang Fang5Jinyan Li6Advanced Analytics Institute, Faculty of Engineering and Information Technology, University of Technology SydneyAdvanced Analytics Institute, Faculty of Engineering and Information Technology, University of Technology SydneyAdvanced Analytics Institute, Faculty of Engineering and Information Technology, University of Technology SydneyAdvanced Analytics Institute, Faculty of Engineering and Information Technology, University of Technology SydneyFaculty of Engineering and Information Technology, University of Technology SydneySchool of Computer, National University of Defense TechnologyAdvanced Analytics Institute, Faculty of Engineering and Information Technology, University of Technology SydneyAbstract Background The early diagnosis of lung cancer has been a critical problem in clinical practice for a long time and identifying differentially expressed gene as disease marker is a promising solution. However, the most existing gene differential expression analysis (DEA) methods have two main drawbacks: First, these methods are based on fixed statistical hypotheses and not always effective; Second, these methods can not identify a certain expression level boundary when there is no obvious expression level gap between control and experiment groups. Methods This paper proposed a novel approach to identify marker genes and gene expression level boundary for lung cancer. By calculating a kernel maximum mean discrepancy, our method can evaluate the expression differences between normal, normal adjacent to tumor (NAT) and tumor samples. For the potential marker genes, the expression level boundaries among different groups are defined with the information entropy method. Results Compared with two conventional methods t-test and fold change, the top average ranked genes selected by our method can achieve better performance under all metrics in the 10-fold cross-validation. Then GO and KEGG enrichment analysis are conducted to explore the biological function of the top 100 ranked genes. At last, we choose the top 10 average ranked genes as lung cancer markers and their expression boundaries are calculated and reported. Conclusion The proposed approach is effective to identify gene markers for lung cancer diagnosis. It is not only more accurate than conventional DEA methods but also provides a reliable method to identify the gene expression level boundaries.https://doi.org/10.1186/s12920-019-0630-4Lung cancerMaximum mean discrepancyInformation theoryBiomarker discovery
collection DOAJ
language English
format Article
sources DOAJ
author Zhixun Zhao
Hui Peng
Xiaocai Zhang
Yi Zheng
Fang Chen
Liang Fang
Jinyan Li
spellingShingle Zhixun Zhao
Hui Peng
Xiaocai Zhang
Yi Zheng
Fang Chen
Liang Fang
Jinyan Li
Identification of lung cancer gene markers through kernel maximum mean discrepancy and information entropy
BMC Medical Genomics
Lung cancer
Maximum mean discrepancy
Information theory
Biomarker discovery
author_facet Zhixun Zhao
Hui Peng
Xiaocai Zhang
Yi Zheng
Fang Chen
Liang Fang
Jinyan Li
author_sort Zhixun Zhao
title Identification of lung cancer gene markers through kernel maximum mean discrepancy and information entropy
title_short Identification of lung cancer gene markers through kernel maximum mean discrepancy and information entropy
title_full Identification of lung cancer gene markers through kernel maximum mean discrepancy and information entropy
title_fullStr Identification of lung cancer gene markers through kernel maximum mean discrepancy and information entropy
title_full_unstemmed Identification of lung cancer gene markers through kernel maximum mean discrepancy and information entropy
title_sort identification of lung cancer gene markers through kernel maximum mean discrepancy and information entropy
publisher BMC
series BMC Medical Genomics
issn 1755-8794
publishDate 2019-12-01
description Abstract Background The early diagnosis of lung cancer has been a critical problem in clinical practice for a long time and identifying differentially expressed gene as disease marker is a promising solution. However, the most existing gene differential expression analysis (DEA) methods have two main drawbacks: First, these methods are based on fixed statistical hypotheses and not always effective; Second, these methods can not identify a certain expression level boundary when there is no obvious expression level gap between control and experiment groups. Methods This paper proposed a novel approach to identify marker genes and gene expression level boundary for lung cancer. By calculating a kernel maximum mean discrepancy, our method can evaluate the expression differences between normal, normal adjacent to tumor (NAT) and tumor samples. For the potential marker genes, the expression level boundaries among different groups are defined with the information entropy method. Results Compared with two conventional methods t-test and fold change, the top average ranked genes selected by our method can achieve better performance under all metrics in the 10-fold cross-validation. Then GO and KEGG enrichment analysis are conducted to explore the biological function of the top 100 ranked genes. At last, we choose the top 10 average ranked genes as lung cancer markers and their expression boundaries are calculated and reported. Conclusion The proposed approach is effective to identify gene markers for lung cancer diagnosis. It is not only more accurate than conventional DEA methods but also provides a reliable method to identify the gene expression level boundaries.
topic Lung cancer
Maximum mean discrepancy
Information theory
Biomarker discovery
url https://doi.org/10.1186/s12920-019-0630-4
work_keys_str_mv AT zhixunzhao identificationoflungcancergenemarkersthroughkernelmaximummeandiscrepancyandinformationentropy
AT huipeng identificationoflungcancergenemarkersthroughkernelmaximummeandiscrepancyandinformationentropy
AT xiaocaizhang identificationoflungcancergenemarkersthroughkernelmaximummeandiscrepancyandinformationentropy
AT yizheng identificationoflungcancergenemarkersthroughkernelmaximummeandiscrepancyandinformationentropy
AT fangchen identificationoflungcancergenemarkersthroughkernelmaximummeandiscrepancyandinformationentropy
AT liangfang identificationoflungcancergenemarkersthroughkernelmaximummeandiscrepancyandinformationentropy
AT jinyanli identificationoflungcancergenemarkersthroughkernelmaximummeandiscrepancyandinformationentropy
_version_ 1721546421716385792