Award details

New Developments of Large-scale Automatic Protein Function Prediction using Graphical Learning Techniques

ReferenceBB/L020505/1
Principal Investigator / Supervisor Professor David Jones
Co-Investigators /
Co-Supervisors
Dr Domenico Cozzetto
Institution University College London
DepartmentComputer Science
Funding typeResearch
Value (£) 310,513
StatusCompleted
TypeResearch Grant
Start date 04/09/2014
End date 03/09/2017
Duration36 months

Abstract

Surveys of public resources show that functional information is still completely missing for a considerable fraction of known proteins and is clearly incomplete for an even larger portion. Moreover, these estimates do not include metagenomics sequences, which pose even tougher challenges to existing functional annotation tools. Bioinformatics methods have long been made use of very diverse data sources alone or in combination to predict protein function, with the understanding that different data types help elucidate complementary biological roles. Recently community-wide initiatives have been launched to critically test existing approaches, to identify successful strategies and highlight bottlenecks that hamper progress. The first CAFA (Critical Assessment of Functional Annotations) experiment found that: (i) the most reliable predictions are based on extensive use of sequence similarities that are often combined with high-throughput data sources; and (ii) there is room for improvement in prediction accuracy and in the deployment of fast, fully automated predictors. Here we propose to research ways to improve the integrative function prediction system that we tested at CAFA and that was ranked at either the top or near the top across a range of benchmarks and evaluation metrics. In particular, we aim at: (i) making better use of existing sources of information, by studying how informative each data source can be relative to a functional category; (ii) adding gene expression profiles and protein-protein interaction network data to increase the make more confident biological process assignments; (iii) exploring new ways of combining component predictions into a single unified probabilistic framework, employing graphical machine learning approaches; and (iv) delivering biologist-friendly Web tools to allow our work to be exploited by scientists working across the whole BBSRC remit.

Summary

A large fraction of the cellular activities required for life are carried out by proteins, some of which have been extensively studied over the years. Knowing exactly what these molecules do, when, where and how has been instrumental for medical and biotechnological use. Unfortunately the required level of details for such advanced applications is only available for a tiny fraction of the proteins in a typical cell; for many of them we have some reasonable clues about their biological. Moreover, there is also a substantial portion that we can barely link to our understanding of biology, even though we are confident that they exist. In human cells, for instance, these represent approximately 40% of the proteins. It is clearly very challenging to experimentally test all the proteins in order to describe their function at the finest level of details. Computer programs can help narrow down the number of assays to run by leveraging on known experimental data and on the fact that some protein features can be used to recognise some well-studied functional units. The underpinning algorithms have become more and more advanced over time, but a number of independent studies have shown that there is still a lot of room for improvement in this field. One clear bottleneck that hampers progress is that all current methods address separately the questions of what proteins do and in which context. However, there is clear evidence that proteins carry out molecular activities in specific cellular compartments and in concert with other biological partners. The proposed project builds on successful previous work on protein function prediction to expand the scope and accuracy of our tools. These already make use of a wide array of heterogeneous experimental data stored in public databases, which can give information about the protein of interest in terms of its evolutionary relationships to other characterized proteins, as well as of the other proteins it physically interacts with or it is co-regulated with, for instance. These diverse sources of information are then combined through some of the most popular machine-learning methods, which were successfully applied in the past in many other areas such as game-playing, speech recognition and e-mail spam filtering. Here we seek to make better use of the information already included in our system, to introduce additional biological data types, as well as to explore new and smarter ways of combining them. We will exploit our expertise in providing reliable and user-friendly online tools for protein structure and function prediction so that the new programs and predictions can be easily used and analyzed by experimentalists for their own research with just a PC and a standard web browser.

Impact Summary

The immediate beneficiaries of this research are the broad community of bench biologists needing additional functional clues for proteins of interest. Both academic and industry scientists will benefit in a similar way as the results of this research will be available freely to all users. Commercial scientists with sensitive data will be able to license the software through UCL Business so that they can exploit the resource without revealing their research interests to other users. Being able to determine even some clue as to the function of the 40% of functionally uncharacterised proteins in model organism genomes can have significant impact in a broad variety of areas e.g. drug, antibody and vaccine design, biochemical engineering, protein design and even nanotechnology. Beyond industrial applications of this research, filling in the major gaps in our knowledge of what the full complement of genes and the products of these genes do and how the proteins interact can have wider implications in understanding the working of healthy cells and how they age. Ultimately this work can make a contribution to our overall understanding of how life processes arise from interactions between a relatively small number of genes in our genomes and the genomes of other organisms.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsX – not assigned to a current Research Topic
Research PriorityX – Research Priority information not available
Research Initiative X - not in an Initiative
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file