How to balance the bioinformatics data: pseudo-negative sampling

Abstract Background Imbalanced datasets are commonly encountered in bioinformatics classification problems, that is, the number of negative samples is much larger than that of positive samples. Particularly, the data imbalance phenomena will make us underestimate the performance of the minority clas...

Full description

Bibliographic Details
Main Authors: Yongqing Zhang, Shaojie Qiao, Rongzhao Lu, Nan Han, Dingxiang Liu, Jiliu Zhou
Format: Article
Language:English
Published: BMC 2019-12-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-019-3269-4
id doaj-eb913d0d60354e29b6c7b60919723c06
record_format Article
spelling doaj-eb913d0d60354e29b6c7b60919723c062020-12-27T12:21:26ZengBMCBMC Bioinformatics1471-21052019-12-0120S2511310.1186/s12859-019-3269-4How to balance the bioinformatics data: pseudo-negative samplingYongqing Zhang0Shaojie Qiao1Rongzhao Lu2Nan Han3Dingxiang Liu4Jiliu Zhou5School of Computer Science, Chengdu University of Information TechnologySchool of Software Engineering, Chengdu University of Information TechnologySchool of Computer Science, Chengdu University of Information TechnologySchool of Management, Chengdu University of Information TechnologySchool of Cybersecurity, Chengdu University of Information TechnologySchool of Computer Science, Chengdu University of Information TechnologyAbstract Background Imbalanced datasets are commonly encountered in bioinformatics classification problems, that is, the number of negative samples is much larger than that of positive samples. Particularly, the data imbalance phenomena will make us underestimate the performance of the minority class of positive samples. Therefore, how to balance the bioinformatic data becomes a very challenging and difficult problem. Results In this study, we propose a new data sampling approach, called pseudo-negative sampling, which can be effectively applied to handle the case that: negative samples greatly dominate positive samples. Specifically, we design a supervised learning method based on a max-relevance min-redundancy criterion beyond Pearson correlation coefficient (MMPCC), which is used to choose pseudo-negative samples from the negative samples and view them as positive samples. In addition, MMPCC uses an incremental searching technique to select optimal pseudo-negative samples to reduce the computation cost. Consequently, the discovered pseudo-negative samples have strong relevance to positive samples and less redundancy to negative ones. Conclusions To validate the performance of our method, we conduct experiments base on four UCI datasets and three real bioinformatics datasets. According to the experimental results, we clearly observe the performance of MMPCC is better than other sampling methods in terms of Sensitivity, Specificity, Accuracy and the Mathew’s Correlation Coefficient. This reveals that the pseudo-negative samples are particularly helpful to solve the imbalance dataset problem. Moreover, the gain of Sensitivity from the minority samples with pseudo-negative samples grows with the improvement of prediction accuracy on all dataset.https://doi.org/10.1186/s12859-019-3269-4Imbalanced dataPseudo-negative samplingPearson correlation coefficientsMax-relevanceMin-redundancy
collection DOAJ
language English
format Article
sources DOAJ
author Yongqing Zhang
Shaojie Qiao
Rongzhao Lu
Nan Han
Dingxiang Liu
Jiliu Zhou
spellingShingle Yongqing Zhang
Shaojie Qiao
Rongzhao Lu
Nan Han
Dingxiang Liu
Jiliu Zhou
How to balance the bioinformatics data: pseudo-negative sampling
BMC Bioinformatics
Imbalanced data
Pseudo-negative sampling
Pearson correlation coefficients
Max-relevance
Min-redundancy
author_facet Yongqing Zhang
Shaojie Qiao
Rongzhao Lu
Nan Han
Dingxiang Liu
Jiliu Zhou
author_sort Yongqing Zhang
title How to balance the bioinformatics data: pseudo-negative sampling
title_short How to balance the bioinformatics data: pseudo-negative sampling
title_full How to balance the bioinformatics data: pseudo-negative sampling
title_fullStr How to balance the bioinformatics data: pseudo-negative sampling
title_full_unstemmed How to balance the bioinformatics data: pseudo-negative sampling
title_sort how to balance the bioinformatics data: pseudo-negative sampling
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2019-12-01
description Abstract Background Imbalanced datasets are commonly encountered in bioinformatics classification problems, that is, the number of negative samples is much larger than that of positive samples. Particularly, the data imbalance phenomena will make us underestimate the performance of the minority class of positive samples. Therefore, how to balance the bioinformatic data becomes a very challenging and difficult problem. Results In this study, we propose a new data sampling approach, called pseudo-negative sampling, which can be effectively applied to handle the case that: negative samples greatly dominate positive samples. Specifically, we design a supervised learning method based on a max-relevance min-redundancy criterion beyond Pearson correlation coefficient (MMPCC), which is used to choose pseudo-negative samples from the negative samples and view them as positive samples. In addition, MMPCC uses an incremental searching technique to select optimal pseudo-negative samples to reduce the computation cost. Consequently, the discovered pseudo-negative samples have strong relevance to positive samples and less redundancy to negative ones. Conclusions To validate the performance of our method, we conduct experiments base on four UCI datasets and three real bioinformatics datasets. According to the experimental results, we clearly observe the performance of MMPCC is better than other sampling methods in terms of Sensitivity, Specificity, Accuracy and the Mathew’s Correlation Coefficient. This reveals that the pseudo-negative samples are particularly helpful to solve the imbalance dataset problem. Moreover, the gain of Sensitivity from the minority samples with pseudo-negative samples grows with the improvement of prediction accuracy on all dataset.
topic Imbalanced data
Pseudo-negative sampling
Pearson correlation coefficients
Max-relevance
Min-redundancy
url https://doi.org/10.1186/s12859-019-3269-4
work_keys_str_mv AT yongqingzhang howtobalancethebioinformaticsdatapseudonegativesampling
AT shaojieqiao howtobalancethebioinformaticsdatapseudonegativesampling
AT rongzhaolu howtobalancethebioinformaticsdatapseudonegativesampling
AT nanhan howtobalancethebioinformaticsdatapseudonegativesampling
AT dingxiangliu howtobalancethebioinformaticsdatapseudonegativesampling
AT jiliuzhou howtobalancethebioinformaticsdatapseudonegativesampling
_version_ 1724369106822496256