SPA: a probabilistic algorithm for spliced alignment.

Recent large-scale cDNA sequencing efforts show that elaborate patterns of splice variation are responsible for much of the proteome diversity in higher eukaryotes. To obtain an accurate account of the repertoire of splice variants, and to gain insight into the mechanisms of alternative splicing, it...

Full description

Bibliographic Details
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2006-04-01
Series:	PLoS Genetics
Online Access:	http://dx.doi.org/10.1371/journal.pgen.0020024

id	doaj-4b172e9ad97a409192944851557ac024
record_format	Article
spelling	doaj-4b172e9ad97a409192944851557ac0242020-11-24T22:39:54ZengPublic Library of Science (PLoS)PLoS Genetics1553-73901553-74042006-04-0124e24SPA: a probabilistic algorithm for spliced alignment.Recent large-scale cDNA sequencing efforts show that elaborate patterns of splice variation are responsible for much of the proteome diversity in higher eukaryotes. To obtain an accurate account of the repertoire of splice variants, and to gain insight into the mechanisms of alternative splicing, it is essential that cDNAs are very accurately mapped to their respective genomes. Currently available algorithms for cDNA-to-genome alignment do not reach the necessary level of accuracy because they use ad hoc scoring models that cannot correctly trade off the likelihoods of various sequencing errors against the probabilities of different gene structures. Here we develop a Bayesian probabilistic approach to cDNA-to-genome alignment. Gene structures are assigned prior probabilities based on the lengths of their introns and exons, and based on the sequences at their splice boundaries. A likelihood model for sequencing errors takes into account the rates at which misincorporation, as well as insertions and deletions of different lengths, occurs during sequencing. The parameters of both the prior and likelihood model can be automatically estimated from a set of cDNAs, thus enabling our method to adapt itself to different organisms and experimental procedures. We implemented our method in a fast cDNA-to-genome alignment program, SPA, and applied it to the FANTOM3 dataset of over 100,000 full-length mouse cDNAs and a dataset of over 20,000 full-length human cDNAs. Comparison with the results of four other mapping programs shows that SPA produces alignments of significantly higher quality. In particular, the quality of the SPA alignments near splice boundaries and SPA's mapping of the 5' and 3' ends of the cDNAs are highly improved, allowing for more accurate identification of transcript starts and ends, and accurate identification of subtle splice variations. Finally, our splice boundary analysis on the human dataset suggests the existence of a novel non-canonical splice site that we also find in the mouse dataset. The SPA software package is available at http://www.biozentrum.unibas.ch/personal/nimwegen/cgi-bin/spa.cgi.http://dx.doi.org/10.1371/journal.pgen.0020024
collection	DOAJ
language	English
format	Article
sources	DOAJ
title	SPA: a probabilistic algorithm for spliced alignment.
spellingShingle	SPA: a probabilistic algorithm for spliced alignment. PLoS Genetics
title_short	SPA: a probabilistic algorithm for spliced alignment.
title_full	SPA: a probabilistic algorithm for spliced alignment.
title_fullStr	SPA: a probabilistic algorithm for spliced alignment.
title_full_unstemmed	SPA: a probabilistic algorithm for spliced alignment.
title_sort	spa: a probabilistic algorithm for spliced alignment.
publisher	Public Library of Science (PLoS)
series	PLoS Genetics
issn	1553-7390 1553-7404
publishDate	2006-04-01
description	Recent large-scale cDNA sequencing efforts show that elaborate patterns of splice variation are responsible for much of the proteome diversity in higher eukaryotes. To obtain an accurate account of the repertoire of splice variants, and to gain insight into the mechanisms of alternative splicing, it is essential that cDNAs are very accurately mapped to their respective genomes. Currently available algorithms for cDNA-to-genome alignment do not reach the necessary level of accuracy because they use ad hoc scoring models that cannot correctly trade off the likelihoods of various sequencing errors against the probabilities of different gene structures. Here we develop a Bayesian probabilistic approach to cDNA-to-genome alignment. Gene structures are assigned prior probabilities based on the lengths of their introns and exons, and based on the sequences at their splice boundaries. A likelihood model for sequencing errors takes into account the rates at which misincorporation, as well as insertions and deletions of different lengths, occurs during sequencing. The parameters of both the prior and likelihood model can be automatically estimated from a set of cDNAs, thus enabling our method to adapt itself to different organisms and experimental procedures. We implemented our method in a fast cDNA-to-genome alignment program, SPA, and applied it to the FANTOM3 dataset of over 100,000 full-length mouse cDNAs and a dataset of over 20,000 full-length human cDNAs. Comparison with the results of four other mapping programs shows that SPA produces alignments of significantly higher quality. In particular, the quality of the SPA alignments near splice boundaries and SPA's mapping of the 5' and 3' ends of the cDNAs are highly improved, allowing for more accurate identification of transcript starts and ends, and accurate identification of subtle splice variations. Finally, our splice boundary analysis on the human dataset suggests the existence of a novel non-canonical splice site that we also find in the mouse dataset. The SPA software package is available at http://www.biozentrum.unibas.ch/personal/nimwegen/cgi-bin/spa.cgi.
url	http://dx.doi.org/10.1371/journal.pgen.0020024
_version_	1725706962340937728

SPA: a probabilistic algorithm for spliced alignment.

Similar Items