SimFuse: A Novel Fusion Simulator for RNA Sequencing (RNA-Seq) Data

The performance evaluation of fusion detection algorithms from high-throughput sequencing data crucially relies on the availability of data with known positive and negative cases of gene rearrangements. The use of simulated data circumvents some shortcomings of real data by generation of an unlimite...

Full description

Bibliographic Details
Main Authors: Yuxiang Tan, Yann Tambouret, Stefano Monti
Format: Article
Language:English
Published: Hindawi Limited 2015-01-01
Series:BioMed Research International
Online Access:http://dx.doi.org/10.1155/2015/780519
id doaj-396edfd178c24e8793e623ff4943a888
record_format Article
spelling doaj-396edfd178c24e8793e623ff4943a8882020-11-24T21:39:48ZengHindawi LimitedBioMed Research International2314-61332314-61412015-01-01201510.1155/2015/780519780519SimFuse: A Novel Fusion Simulator for RNA Sequencing (RNA-Seq) DataYuxiang Tan0Yann Tambouret1Stefano Monti2Bioinformatics, Boston University, Boston, MA 02215, USAResearch Computing Services (IS&T), Boston University, Boston, MA 02215, USABioinformatics, Boston University, Boston, MA 02215, USAThe performance evaluation of fusion detection algorithms from high-throughput sequencing data crucially relies on the availability of data with known positive and negative cases of gene rearrangements. The use of simulated data circumvents some shortcomings of real data by generation of an unlimited number of true and false positive events, and the consequent robust estimation of accuracy measures, such as precision and recall. Although a few simulated fusion datasets from RNA Sequencing (RNA-Seq) are available, they are of limited sample size. This makes it difficult to systematically evaluate the performance of RNA-Seq based fusion-detection algorithms. Here, we present SimFuse to address this problem. SimFuse utilizes real sequencing data as the fusions’ background to closely approximate the distribution of reads from a real sequencing library and uses a reference genome as the template from which to simulate fusions’ supporting reads. To assess the supporting read-specific performance, SimFuse generates multiple datasets with various numbers of fusion supporting reads. Compared to an extant simulated dataset, SimFuse gives users control over the supporting read features and the sample size of the simulated library, based on which the performance metrics needed for the validation and comparison of alternative fusion-detection algorithms can be rigorously estimated.http://dx.doi.org/10.1155/2015/780519
collection DOAJ
language English
format Article
sources DOAJ
author Yuxiang Tan
Yann Tambouret
Stefano Monti
spellingShingle Yuxiang Tan
Yann Tambouret
Stefano Monti
SimFuse: A Novel Fusion Simulator for RNA Sequencing (RNA-Seq) Data
BioMed Research International
author_facet Yuxiang Tan
Yann Tambouret
Stefano Monti
author_sort Yuxiang Tan
title SimFuse: A Novel Fusion Simulator for RNA Sequencing (RNA-Seq) Data
title_short SimFuse: A Novel Fusion Simulator for RNA Sequencing (RNA-Seq) Data
title_full SimFuse: A Novel Fusion Simulator for RNA Sequencing (RNA-Seq) Data
title_fullStr SimFuse: A Novel Fusion Simulator for RNA Sequencing (RNA-Seq) Data
title_full_unstemmed SimFuse: A Novel Fusion Simulator for RNA Sequencing (RNA-Seq) Data
title_sort simfuse: a novel fusion simulator for rna sequencing (rna-seq) data
publisher Hindawi Limited
series BioMed Research International
issn 2314-6133
2314-6141
publishDate 2015-01-01
description The performance evaluation of fusion detection algorithms from high-throughput sequencing data crucially relies on the availability of data with known positive and negative cases of gene rearrangements. The use of simulated data circumvents some shortcomings of real data by generation of an unlimited number of true and false positive events, and the consequent robust estimation of accuracy measures, such as precision and recall. Although a few simulated fusion datasets from RNA Sequencing (RNA-Seq) are available, they are of limited sample size. This makes it difficult to systematically evaluate the performance of RNA-Seq based fusion-detection algorithms. Here, we present SimFuse to address this problem. SimFuse utilizes real sequencing data as the fusions’ background to closely approximate the distribution of reads from a real sequencing library and uses a reference genome as the template from which to simulate fusions’ supporting reads. To assess the supporting read-specific performance, SimFuse generates multiple datasets with various numbers of fusion supporting reads. Compared to an extant simulated dataset, SimFuse gives users control over the supporting read features and the sample size of the simulated library, based on which the performance metrics needed for the validation and comparison of alternative fusion-detection algorithms can be rigorously estimated.
url http://dx.doi.org/10.1155/2015/780519
work_keys_str_mv AT yuxiangtan simfuseanovelfusionsimulatorforrnasequencingrnaseqdata
AT yanntambouret simfuseanovelfusionsimulatorforrnasequencingrnaseqdata
AT stefanomonti simfuseanovelfusionsimulatorforrnasequencingrnaseqdata
_version_ 1725929182701027328