EMQIT: a machine learning approach for energy based PWM matrix quality improvement

Abstract Background Transcription factor binding affinities to DNA play a key role for the gene regulation. Learning the specificity of the mechanisms of binding TFs to DNA is important both to experimentalists and theoreticians. With the development of high-throughput methods such as, e.g., ChiP-se...

Full description

Bibliographic Details
Main Authors: Karolina Smolinska, Marcin Pacholczyk
Format: Article
Language:English
Published: BMC 2017-08-01
Series:Biology Direct
Subjects:
Online Access:http://link.springer.com/article/10.1186/s13062-017-0189-y
id doaj-28e3e2d7b5be47a1967273a15d62d797
record_format Article
spelling doaj-28e3e2d7b5be47a1967273a15d62d7972020-11-25T00:42:44ZengBMCBiology Direct1745-61502017-08-011211810.1186/s13062-017-0189-yEMQIT: a machine learning approach for energy based PWM matrix quality improvementKarolina Smolinska0Marcin Pacholczyk1Institute of Automatic Control, Silesian University of TechnologyInstitute of Automatic Control, Silesian University of TechnologyAbstract Background Transcription factor binding affinities to DNA play a key role for the gene regulation. Learning the specificity of the mechanisms of binding TFs to DNA is important both to experimentalists and theoreticians. With the development of high-throughput methods such as, e.g., ChiP-seq the need to provide unbiased models of binding events has been made apparent. We present EMQIT a modification to the approach introduced by Alamanova et al. and later implemented as 3DTF server. We observed that tuning of Boltzmann factor weights, used for conversion of calculated energies to nucleotide probabilities, has a significant impact on the quality of the associated PWM matrix. Results Consequently, we proposed to use receiver operator characteristics curves and the 10-fold cross-validation to learn best weights using experimentally verified data from TRANSFAC database. We applied our method to data available for various TFs. We verified the efficiency of detecting TF binding sites by the 3DTF matrices improved with our technique using experimental data from the TRANSFAC database. The comparison showed a significant similarity and comparable performance between the improved and the experimental matrices (TRANSFAC). Improved 3DTF matrices achieved significantly higher AUC values than the original 3DTF matrices (at least by 0.1) and, at the same time, detected notably more experimentally verified TFBSs. Conclusions The resulting new improved PWM matrices for analyzed factors show similarity to TRANSFAC matrices. Matrices had comparable predictive capabilities. Moreover, improved PWMs achieve better results than matrices downloaded from 3DTF server. Presented approach is general and applicable to any energy-based matrices. EMQIT is available online at http://biosolvers.polsl.pl:3838/emqit . Reviewers This article was reviewed by Oliviero Carugo, Marek Kimmel and István Simon.http://link.springer.com/article/10.1186/s13062-017-0189-yPWM matrixTRANSFACJasparTFBS3DTF
collection DOAJ
language English
format Article
sources DOAJ
author Karolina Smolinska
Marcin Pacholczyk
spellingShingle Karolina Smolinska
Marcin Pacholczyk
EMQIT: a machine learning approach for energy based PWM matrix quality improvement
Biology Direct
PWM matrix
TRANSFAC
Jaspar
TFBS
3DTF
author_facet Karolina Smolinska
Marcin Pacholczyk
author_sort Karolina Smolinska
title EMQIT: a machine learning approach for energy based PWM matrix quality improvement
title_short EMQIT: a machine learning approach for energy based PWM matrix quality improvement
title_full EMQIT: a machine learning approach for energy based PWM matrix quality improvement
title_fullStr EMQIT: a machine learning approach for energy based PWM matrix quality improvement
title_full_unstemmed EMQIT: a machine learning approach for energy based PWM matrix quality improvement
title_sort emqit: a machine learning approach for energy based pwm matrix quality improvement
publisher BMC
series Biology Direct
issn 1745-6150
publishDate 2017-08-01
description Abstract Background Transcription factor binding affinities to DNA play a key role for the gene regulation. Learning the specificity of the mechanisms of binding TFs to DNA is important both to experimentalists and theoreticians. With the development of high-throughput methods such as, e.g., ChiP-seq the need to provide unbiased models of binding events has been made apparent. We present EMQIT a modification to the approach introduced by Alamanova et al. and later implemented as 3DTF server. We observed that tuning of Boltzmann factor weights, used for conversion of calculated energies to nucleotide probabilities, has a significant impact on the quality of the associated PWM matrix. Results Consequently, we proposed to use receiver operator characteristics curves and the 10-fold cross-validation to learn best weights using experimentally verified data from TRANSFAC database. We applied our method to data available for various TFs. We verified the efficiency of detecting TF binding sites by the 3DTF matrices improved with our technique using experimental data from the TRANSFAC database. The comparison showed a significant similarity and comparable performance between the improved and the experimental matrices (TRANSFAC). Improved 3DTF matrices achieved significantly higher AUC values than the original 3DTF matrices (at least by 0.1) and, at the same time, detected notably more experimentally verified TFBSs. Conclusions The resulting new improved PWM matrices for analyzed factors show similarity to TRANSFAC matrices. Matrices had comparable predictive capabilities. Moreover, improved PWMs achieve better results than matrices downloaded from 3DTF server. Presented approach is general and applicable to any energy-based matrices. EMQIT is available online at http://biosolvers.polsl.pl:3838/emqit . Reviewers This article was reviewed by Oliviero Carugo, Marek Kimmel and István Simon.
topic PWM matrix
TRANSFAC
Jaspar
TFBS
3DTF
url http://link.springer.com/article/10.1186/s13062-017-0189-y
work_keys_str_mv AT karolinasmolinska emqitamachinelearningapproachforenergybasedpwmmatrixqualityimprovement
AT marcinpacholczyk emqitamachinelearningapproachforenergybasedpwmmatrixqualityimprovement
_version_ 1725280583170392064