Summary: | Automated biological sequence annotation is a rapidly developing field. The need for computational tools to facilitate this exercise is of stark necessity given the rate at which new genetic sequences are being accumulated. This dissertation introduces new approaches to two fundamental problems in contemporary bioinformatics, and their synergic integration. They are described in two parts: In Part I, we introduce a novel approach based on templates to differentiate between transcription factor binding sites and non-binding sites. Templates model three key discriminatory features, sequence homology, structural homology and nucleotide polymorphisms present in various degrees in different transcription factor binding sites. We show how templates can be adopted to predict the actual binding affinity of a given binding site based on the distribution of binding affinities of a set of training sites. We also present examples demonstrating the excellen-t discriminative and predictive capabilities of templates for transcription factor binding sites. In Part II, we introduce a new framework for sequence alignment. Here, information is seen as information units that act upon the sequences being aligned rather than an intrinsic part of the sequences themselves. The result is a versatile alignment tool, a tool that can dynamically incorporate knowledge on demand to a sequence alignment. We describe efficient data structures that form an integral part of such alignment tool. The described data structures are efficient in terms of both storage and retrieval of information. We illustrate a hybrid alignment strategy geared towards accommodating the diversity of information available on the sequences being aligned. The alignment algorithms described are optimised over a combination of both the alignment of individual residue pairs and the alignment of sequence segments. We present examples demonstrating the versatility of the described alignment framework and the high quality of alignments that it produces.
|