Award details

Developing a novel web-based tool for functional annotation of proteins

ReferenceBB/I023992/1
Principal Investigator / Supervisor Professor David Jones
Co-Investigators /
Co-Supervisors
Institution University College London
DepartmentComputer Science
Funding typeResearch
Value (£) 119,784
StatusCompleted
TypeResearch Grant
Start date 04/10/2011
End date 03/03/2013
Duration17 months

Abstract

The impact of high throughput sequencing technologies since the 1980's has produced over 100 billion base pairs of DNA sequence, cataloguing the genetic material of more than 1000 organisms. Genome sequences provide information not only for a complete set of genes and their precise locations in the chromosome, but also help to define the core proteome i.e. the set of functional proteins that are the work horse components of living cells. In this post-sequencing era, a detailed characterisation of a protein, its structural form, functional role and interactions with other molecules is the next key step in driving our understanding of cellular processes, along with responses to external stimuli or changes to the organism's environment. Ultimately this could also lead to new understanding of biological systems and related disease mechanisms. We propose here to build a web-based tool around our existing collection of publicly available data relevant to predicting protein function and to apply state of the art machine learning techniques to the integration of these data. Thus we will extract novel functional annotations for a number of model organism genomes (including human, mouse and yeast). The range of data sources we use includes protein sequence features, genome-wide domain-based evolutionary information (e.g. domain fusions), publicly available transcriptomic data, microRNA and other regulatory binding sites and both experimental and predicted protein-protein interaction maps. This analysis will rely on supercomputing facilities available at UCL in the form of the recently deployed Legion supercomputer which will be further upgraded in 2010. Finally, we propose to build a novel user-driven protein classification tool, which will allow any biologist to compile his or her own protein classifier with no expertise needed in machine learning.

Summary

Every living cell within an organism contains thousands of different protein molecules. Although we know the biological function or role for most of these proteins, for 40% or so of the proteins in a typical human cell, for example, have no known function - although we are fairly certain that they do indeed have a function. Some people refer to this set of unknown genes as the 'dark matter' within the genome i.e. we know the genes are present but we simply do not know why there are there. Function can be described in many different ways, and given that the interactions of groups of proteins are perhaps the most important types of processes occurring within cells, describing the function of a protein by the interactions it makes with other proteins (i.e. in the form of networks) is clearly a good approach. Knowing how these protein molecules correctly bind together, or interact, can even help us to understand when something goes wrong within the machinery of a cell, for example during aging processes. This in turn will allow us to better understand disease or aging and will also perhaps help us to develop medicines to correct the faulty machinery. By taking data from many different experimental sources, all from publicly available databases, computers can help us to successfully predict the function of a protein based on, for example, determining what other proteins it interacts with and in what genes are switched on in synchrony with the protein's own encoding gene. We can also look at component features of a protein, such as which parts might be embedded in the membranes surrounding the cell or what kind of overall shape the protein might have. This project seeks to develop new computer programs to analyse these sorts of data and thus help biologists deduce the functions of the many genes and the proteins they encode whose functions are currently unknown. These programs and predictions will be made available via the World Wide Web, so that biologists can easilymake use of our results for their own research work with just a PC and a standard web browser. This should greatly help with research into how cells work and how they might go wrong during disease or aging. Ultimately, such discoveries might even lead to the development of new drugs and treatments or even new industrial processes for synthesising useful chemicals.

Impact Summary

The immediate beneficiaries of this research are the broad community of bench biologists needing additional functional clues for proteins of interest. Both academic and industry scientists will benefit in a similar way as the results of this research will be available freely to all users. Commercial scientists with sensitive data will be able to license the software through UCL Business so that they can exploit the resource without revealing their research interests to other users. Being able to determine even some clue as to the function of the 40% of functionally uncharacterised proteins in model organism genomes can have significant impact in a broad variety of areas e.g. drug, antibody and vaccine design, biochemical engineering, protein design and even nanotechnology. Beyond industrial applications of this research, filling in the major gaps in our knowledge of what the full complement of genes and the products of these genes do and how the proteins interact can have wider implications in understanding the working of healthy cells and how they age. Ultimately this work can make a contribution to our overall understanding of how life processes arise from interactions between a relatively small number of genes in our genomes and the genomes of other organisms.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsTechnology and Methods Development
Research PriorityTechnology Development for the Biosciences
Research Initiative Tools and Resources Development Fund (TRDF) [2006-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file