Summary: | Since the advent of complete genome sequencing, vast amounts of nucleotide and amino acid sequence data have been produced. These data need to be effectively analysed and verified so that they may be used for biological discovery. A significant proportion of predicted protein sequences from these complete genomes have poorly characterised or unknown functional annotations. This thesis describes a number of approaches which detail the computational analysis of amino acid sequences for the prediction and analysis of protein function within complete genomes. The first chapter is a short introduction to computational genome analysis while the second and third chapters describe how groups of related protein sequences (termed <i>protein families</i>) may be characterised using sequence clustering algorithms. Two novel and complementary sequence clustering algorithms will be presented together with details of how protein family information can be used to detect and describe the functions of proteins in complete genomes. Further research is described which uses this protein family information to analyse the molecular evolution of proteins within complete genomes. Recent developments in genome analysis have shown that the computational prediction of protein function in complete genomes is not limited to pure sequence homology methods. So called <i>genome context</i> method use other information from complete genome sequences to predict protein function. Examples of this include gene location or neighbourhood analysis and the analysis of the phyletic distribution of protein sequences. In chapter four a novel method for predicting whether two proteins are functionally associated or physically interact is described. This method is based on the detection of gene-fusion events within complete genomes. During this research many novel tools and methods for genome analysis, data-visualisation, data-mining and high-performance biological computing were developed. Many of these tools represent interesting research projects in their own right, and formed the basis from which the research carried out in this thesis was conducted. Within the final chapter a selection of these novel algorithms developed during this research is described.
|