Summary: | International biological sequence databases hold information about protein and DNA molecules. The molecules are represented by sequences of characters. In analysis of this data algorithms for comparing the character sequences play a central role. Comparisons can be made using dynamic programming techniques to determine the score of optimal sequence alignments. Such methods are particularly popular with molecular biologists for they accommodate the kinds of differences which actually occur in the sequences of related molecules. Sequence alignments are normally scored using score tables based on an evolutionary model. The derivation of these score tables is re-examined and a formula giving an analytic counterpart to an empirical method for assessment of a score table's discriminating power is found. Use of the formula to derive alternative protein similarity scoring tables is discussed. A new approach to tackling the heavy computational demands of the dynamic programming algorithm is described: intensive optimisation of a microcomputer implementation. This provides an alternative to implementations which use parallel computers for searching protein databases. This thesis also describes how other implementational problems were tackled in order to make more effective use of the serial comparison software. The new software permitted comparison by optimal alignment of 32,000,000 pairs of sequences from a protein database using widely available and inexpensive hardware. The results from this search were then reorganised to facilitate the findings of previously unseen similarities. Software tools were written to assist with the analysis including software to align sequence families. From the results of this work, nine similarities are presented which do not appear to have been previously noted. The examples illustrate factors that are important in assessing similarities with scores close to the boundaries of significance. The similarities presented are of particular interest because of the biological functions they relate. One software tool developed for the sequence analysis work was a new multiple sequence alignment editor and sequence aligner, 'medal'. Lessons from its use on real sequence data lead to a modification to the original comparison method to accommodate local variations in sequence similarity. Consideration is given to parallelisation of this modification and of the methods used to obtain speed in the serial software. Alternatives are suggested. The suggested parallel method to cope with variations in sequence similarity requires two interdependent sequence comparisons. A serial program using three interdependent comparisons is demonstated and shows the feasibility of multiple interdependent comparisons. Examples show how this new program, 'Fradho', can compare DNA sequences to protein sequences accommodating frameshifts.
|