Award details

Development of a graph-theoretic approach to predict protein function by integrating large scale heterogeneous data

ReferenceBB/F00964X/1
Principal Investigator / Supervisor Professor Alberto Paccanaro
Co-Investigators /
Co-Supervisors
Professor Laszlo Bogre
Institution Royal Holloway, Univ of London
DepartmentComputer Science
Funding typeResearch
Value (£) 419,814
StatusCompleted
TypeResearch Grant
Start date 01/10/2008
End date 29/02/2012
Duration41 months

Abstract

Statistically sound large-scale protein function prediction can be obtained only by integrating evidence from different sources. Functional inference methods that exploit biological networks topologies offer good performance. But so far such methods are limited in the type of data they can integrate, while methods that can integrate a greater variety of data do not take advantage of the networks' topologies. I propose a general method that can integrate essentially any data type available taking into account the intrinsic structure of each data type: it uses graph-theoretic methods to produce functional evidence from network data, and it integrates it with evidence from one-dimensional information using machine learning techniques. Defining function in terms of the Gene Ontology, I shall collect datasets for S. cerevisiae, C. elegans, D. melanogaster, A. thaliana, H. sapiens. Algorithm development and testing will be done on S. cerevisiae. I shall then verify how these methods transfer to the other organisms. Performance on these organisms will be evaluated 'in silico', by means of test sets. The approach will also be tested 'in vivo' by predicting the Biological Process for a group of MAP kinases that belong to the signalling pathways of A. thaliana. These predictions will be tested through functional assays: 1. an RNAi screen and quantitative measurements of MAPK signalling outputs, MAPK activities and promoter activations in cultured Arabidopsis cells 2. quantitative phenotypic tests for selected phenotypes in cell differentiation (e.g. stomata development) and stress responses. I shall design and implement stand-alone and web-based software tools incorporating the algorithms developed. These will enable the biologist to easily apply the algorithms through a user-friendly interface; visualization tools will make the functional inference process transparent to the user. All these tools will be made freely available to the scientific community.

Summary

The list of organisms with completed genome sequence is continuously growing and this has led to the identification of thousands of genes whose function is still unknown. These genes could potentially be involved in important biological cell functions and could represent important targets for diagnostic and pharmacogenomics studies and be of industrial and agronomical importance. A major undertaking for biology is therefore that of identifying the function of these uncharacterized genes on a genomic scale. The challenge for bioinformatics is then to devise algorithmic methods that, given a gene, can predict a hypothesis for its function that can then be validated by wet-lab assays. Luckily, new experimental techniques have become available, producing data which offer clues about protein function and can therefore be employed for function prediction, e.g. protein interaction data, gene expression data. Some experimental and computational data have a natural representation as networks (e.g. protein interaction data), others are inherently 'one-dimensional' (e.g. sequence patterns). Three facts have recently become clear: while each data type contains important information that can help in determining the function of a protein, no single data type by itself suffices; large-scale functional inference greatly improves by integrating evidence from different sources; for those data types which can be represented as networks, the best results are obtained by algorithms that take advantage of the networks' topologies. So far, methods that make functional inferences on networks are very limited in the type of data they can integrate, while methods that can integrate a greater variety of data do not take advantage of the networks' topologies. I intend to investigate a general method that can integrate essentially any data type currently available taking into account its intrinsic structure: it takes advantage of the graph topology for network data, and it can integrate thisevidence together with one-dimensional information. I shall develop graph-theoretical methods that use the diffusion of information over graphs to generate functional evidence from network data. This evidence is then combined with other one-dimensional information using machine learning techniques. The strength of the methodology lies in its ability to use diverse sets of noisy data, and to combine them to obtain sound statistical inferences; the weak signals contained in each dataset is enhanced by integrating the data. The methodology will be first developed on Yeast, and I shall then transfer this approach to higher organisms such as C. elegans, D. melanogaster, A. thaliana, and H. sapiens. For all these organisms the performance of the algorithms will then be evaluated 'in silico' by means of test sets; that is I shall verify the accuracy of the methods at predicting the function for genes whose annotation is known. The approach will then be tested 'in vivo' on a sub-network of genes that form signalling pathways (MAPK signalling) and function to transmit information from receptors to gene expression. MAPK pathway components are highly diversified in the model plant, Arabidopsis thaliana, with 123 components. For many of these we do not know how they connect up and what their biological functions are. These will be predicted by the algorithms and then functionally tested by silencing their expression using RNA interference and in mutant lines. I shall also design and implement stand-alone and web-based software tools incorporating the algorithms developed. The applications will enable the biologist to easily apply the algorithms through a user-friendly interface; to visualize the relevant biological networks thus making the inference process transparent and providing an explanation for the functional annotation predicted by the system. A web tool will also be created. All these tools will be made freely available to the scientific community.
Committee Closed Committee - Engineering & Biological Systems (EBS)
Research TopicsPlant Science, Systems Biology, Technology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative X - not in an Initiative
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file