Summary: | The biological functions of a protein within the cell are governed by its protein interactions. While these interactions have recently become widely available for many organisms, they are not yet fully explored with regards to the insights into protein characteristics they might provide. In this dissertation, I not only explore the protein characteristics, e.g, structure and function, but also further utilise these characteristics in the prediction of protein interactions. An exploratory analysis of protein interactions as networks reveals local clustering including a tendency of protein triangle formation. This observation prompted me to develop statistical approaches beyond pairwise interactions, as well as considering the relationship between protein characteristics and protein interactions. My methods consider both pairwise and triple-wise interactions. Prior information from other organisms is also incorporated, and diverse biological characteristics can be integrated simultaneously. In the task of predicting protein characteristics, for large networks my pair-based score is more accurate than the popular Majority Vote method. Surprisingly, however, my triple-based score does not outperform the simpler pair-based method. This may be due to poor data quality and/or other unknown biological factors. In the task of predicting protein interactions, I demonstrate that the inclusion of network structure in the form of triples significantly improves results over three other standard interaction predictors as well as a pair based version of the method. It also achieves a greater coverage. Unsurprising, therefore it appears that different task may require different models. A global model for protein interaction networks based on triples is outperformed by a pairwise method when it comes to prediCting protein characteristics. Conversely, a triple-base method outperforms a model based on pairwise interactions when it comes to predicting interactions. . My methods offer three main improvements over current approaches. Firstly, they consider network structures of pairs and triples in the prediction. Secondly, in all my methods multiple protein characteristics can be considered simultaneously, which greatly improves the prediction. Thirdly, data from multiple species can be easily integrated~ Interestingly on this last point the result suggests that there may be fundamental differences between the networks of eukaryotes and prokaryotes.
|