BBSRC Portfolio Analyser
Award details
A GPU-based high performance system for discovering consensus domain architecture and functional annotation of protein families
Reference
BB/K004131/1
Principal Investigator / Supervisor
Professor Alberto Paccanaro
Co-Investigators /
Co-Supervisors
Institution
Royal Holloway, Univ of London
Department
Computer Science
Funding type
Research
Value (£)
114,627
Status
Completed
Type
Research Grant
Start date
01/07/2012
End date
31/12/2013
Duration
18 months
Abstract
Proteins can be grouped into families, where members are likely to perform similar functions. The identification of these protein families is important as it can provide important clues for the function of proteins. Proteins are often composed of more than one domain (basic structural or functional units of evolution) and protein function depends on the mutual interplay between the distinct domains and the links between them. In other words, protein function depends on the domain architecture of the protein. The preliminary work for this project is constituted by GFam, a system that we have recently developed which is able to group proteins into families where proteins share common domain architecture. GFam has been applied to sets of about 30k proteins. The current proposal is aimed at creating tools for providing consensus signature architectures for very large sequence datasets. To do this we shall develop a new high performance implementation of the GFam pipeline. This will run in parallel on multiple processors servers and multiple GPGPUs. Moreover we shall develop a web application that will provide a display of the architectures through a user-friendly web interface. The interface will allow users to to retrieve proteins with the same architecture in either the same or in different organisms. Importantly will also provide sets of functional annotation terms associated with each of the different consensus signature architectures. The system will be run periodically on all complete genome projects and will provide protein families architectures and their functional annotation for all the proteins in those genomes.
Summary
The list of organisms with completed genome sequence is continuously growing and this has led to the identification of thousands of genes whose function is still unknown. These genes could potentially be involved in important biological cell functions and could represent important targets for diagnostic and pharmacogenomics studies and be of industrial and agronomical importance. A major undertaking for biology is therefore that of identifying the function of these uncharacterized genes on a genomic scale. The challenge for bioinformatics is then to develop algorithms that, given a gene, can predict a hypothesis for its function. Comparisons of sequences from complete genomes have revealed that gene duplication, divergence and rearrangement are predominant mechanisms that drive the expansion of the set of proteins of a given organism during evolution. This means that proteins can be grouped into families, where members are likely to perform similar functions. The identification of these protein families is therefore central as it can provide important clues for the function of proteins. Proteins are often composed of several domains. A domain is segment of protein sequence that can evolve independently of the rest of the protein chain. Each domain forms a compact three-dimensional structure and it can appear in a variety of different proteins. Protein function depends on the mutual interplay between the distinct domains and the links between them. In other words, protein function depends on the domain architecture of the protein. Therefore we would like to have a tool that can group proteins into families according to their architecture: all proteins with the same architecture should belong to the same group. The development of such a tool is exactly the goal of this project. Moreover the tool that we plan here will also be able to suggest possible functional roles for the various architectures. Our tool is aimed at working on very large sets of proteins. The amount of calculations for problems of this size is only feasible by taking advantage of the latest advances in graphical processing unit (GPU) technology. Modern GPUs are very efficient for graphics but their highly parallel structure makes them extremely effective for algorithms where processing of large blocks of data is done in parallel - even more effective than general-purpose CPUs. The use of GPU technology will allow us to create a web application that will be used by scientists to obtain the architectures for very large set of proteins together with possible functional roles for the various architectures. Importantly, we shall periodically run our system on the major genomes available and we will thus be able to through our web server architectures and relative annotation for all the proteins in those genomes. All these web services will be made freely available to the scientific community.
Impact Summary
This project will benefit biologists interested in protein function, both experimental and computational scientists from academia and industry. The results will impact any biologist interested in understanding how organisms and life processes arise through natural selection mechanisms acting on the protein repertoire encoded by the genome of the organism. Our method will make a serious impact towards understanding the large proportion of uncharacterized genes and proteins as genome sequencing efforts have left us with near-complete knowledge of hundreds of full genomic sequences, but without a comparably exhaustive inventory of what all these genes and proteins do; in many cases we have no clues to the function of these genes. This is critical as obtaining even some clue on the function of the 40% of functionally uncharacterized proteins in model organism genomes can have significant impact in a broad variety of areas e.g. drug, antibody and vaccine design, agronomic trait improvement, biochemical engineering, protein design and even nanotechnology. The key impact of our research is that our web portal seeks to serve as a one-stop source on proteins, their organization in the form of domains and their myriad biological, biochemical and cellular functions under one umbrella. Given that our web portal will be an integrated resource, it has the potential to be a great teaching resource for under-graduate and graduate course in genome biology, protein science and evolutionary biology. Since this project aims to use the latest computing advancements in hardware technology through the use of Graphics Processor Units, the staff working on this project will take with them transferable skills that are much sought after in academia and industry. For staff interested in continuing in academia, the dissemination of the research through peer-reviewed journal articles will help attain their career goals in the form of faculty or postdoc positions.
Committee
Research Committee C (Genes, development and STEM approaches to biology)
Research Topics
Structural Biology, Technology and Methods Development
Research Priority
X – Research Priority information not available
Research Initiative
Tools and Resources Development Fund (TRDF) [2006-2015]
Funding Scheme
X – not Funded via a specific Funding Scheme
I accept the
terms and conditions of use
(opens in new window)
export PDF file
back to list
new search