Award details

Exploiting High Performance Computing to Provide Functional Annotations via CATH-Gene3D

ReferenceBB/H02364X/1
Principal Investigator / Supervisor Professor Christine Orengo
Co-Investigators /
Co-Supervisors
Institution University College London
DepartmentStructural Molecular Biology
Funding typeResearch
Value (£) 108,971
StatusCompleted
TypeResearch Grant
Start date 01/11/2010
End date 31/10/2011
Duration12 months

Abstract

The major and technically most challenging part of our project is the porting of GeMMA to publicly available HPC facilities so that it can be run for each CATH-Gene3D release. We will extend our web sites and servers to present GeMMA annotations by using methodologies well established for CATH-Gene3D. We will refine the GeMMA protocol so that it can be ported to different multiple public HPC facilities. This will involve modifying the current HPC strategy (which uses local compute clusters) to exploit other, much larger, public services such as - the UK National Grid Service (NGS) - the HECToR supercomputer - the European grid consortium EGEE (Enabling Grids for E-sciencE) - the BlueGene facility at Argonne National Laboratories, US In addition, we will use paid infrastructure-on-demand services such as the Amazon EC2 compute cloud and the corresponding Amazon S3 storage service. While porting the current GeMMA HPC implementation to the systems listed above should be relatively straight-forward the Amazon services will require substantial changes to the protocol. Amazon virtual machines can be 'rented' for weeks or months and used either in a cluster-like scenario resembling the current HPC implementation (e.g. via Sun SGE's new 'cloud adapter' software) or in a purely parallel way, e.g. each running one large superfamily at a time. In either case, scripts have to schedule and survey the individual processing tasks. We will develop a pipeline which allows us to run GeMMA once or twice a year i.e. with each release of CATH-Gene3D. Over the last two years, CATH-Gene3D has doubled the number of sequences classified, to ~5 million distinct protein sequences coming from a number of sequence repositories. However, international sequencing efforts, particularly the JGI's GEBA genomes project and the large metagenome initiatives will lead to even greater expansions of the classification.

Summary

Over the last ten years there have been intense efforts to determine the protein compositions of different organisms, including human and other model organisms from all kingdoms of life. Currently more than 1,000 organisms have been completely sequenced and nearly 10 million protein sequences determined. In 2000 the human genome was completed and the latest estimates say it contains between 23,000 and 25,000 protein-coding genes. It is difficult, expensive and time-consuming to determine the functional properties of all these proteins and for many organisms, including human, fewer than 15% of the proteins have been directly experimentally characterised to determine their function. Therefore, a major activity and challenge for bioinformatics groups has been the need to devise computational methods for inferring the functions of proteins. Most predictive methods exploit the premise that proteins in different species are related to each other (homologues) as they have evolved from a common ancestral protein. These homologous proteins frequently share similar functional properties, conserved during evolution. Therefore, many methods search for similarities in the sequences of proteins, indicative of an evolutionary relationship, which then allows functional information to be inherited. In other words, a protein that has been experimentally characterised in fly, for example, can be used to assign functional properties to an evolutionary related protein identified in human. The main challenge faced by these approaches is the fact that gene duplication occurs in all organisms throughout evolution. Therefore, as well as the original copy of a protein, derived from an ancestral protein, there can be additional copies which may have evolved slightly modified functions to expand the functional repertoire of the organism, thereby enhancing its survival. We have developed a resource (CATH-Gene3D) which groups proteins into evolutionary families on the basis of similarities in their 3D structures (where available) and their sequences. Currently, more than 2,200 families are classified in CATH-Gene3D accounting for the majority of protein domain sequences. Some of these families contain very many sequences as the proteins have been highly duplicated in organisms. These families pose a challenge to function prediction methods as the functions of the relatives have frequently diverged. We have designed a new method (GeMMA) which uses a sophisticated approach for comparing sets of evolutionary sequences to group them into subfamilies of proteins, which are very likely to share functional properties. Whilst GeMMA has been shown to be accurate in transferring functional information between relatives it can take a long time to run for the very large families in CATH-Gene3D. Therefore, to speed it up, this project will modify the GeMMA protocol so that we can run it on a wide range of publicly available HPC resources. We will also develop highly intuitive web pages to make the information provided by the GeMMA subfamilies very accessible for the biology community. This web site will also allow biologists to submit a query protein of unknown function which will then be searched against the GeMMA subfamilies to predict a putative function. CATH-Gene3D is already widely used by biologists and this new functional sub-classification will make the resource even more valuable to these researchers by providing more precise functional annotations for the novel proteins they are studying.

Impact Summary

Communications and Engagement The modified GeMMA protocol will allow us to provide more accurate functional annotations for all the major protein domain superfamilies in nature. We will disseminate this information relying on our extensive resource and service design expertise: - Extend the CATH-Gene3D website with new subfamily pages and a subfamily assignment server Users will be able to submit query sequences for subfamily assignment and investigate functional annotations. The complete GeMMA profile library will also be available for download. - Distribute GeMMA annotations via the InterPro web site CATH-Gene3D is one of the InterPro member databases and we regularly provide superfamily HMMs to InterPro, forming an important part of this reknown annotation meta-server. The GeMMA subfamily profiles will become part of this package. InterPro receives nearly 5 million web page accesses per month. CATH-Gene3D annotations are also hosted on the CARGO website for cancer mutations (CNIO, Madrid) and the e-pipe website for splice variants (TU Denmark, Lingby). - Provide the annotations through web services. We already supply annotations for CATH-Gene3D superfamilies via the DAS (Distributed Annotations Services) Registry at the EBI (http://www.dasregistry.org/). The GeMMA functional annotations will be made available through DAS, the EMBRACE registry and BioCatalogue (http://www.biocatalogue.org/). The CATH-Gene3D website receives 1 million web hits per month (excluding search engine robots) corresponding to 372,104 page impressions per month from 8,444 unique hosts. CATH-Gene3D is widely used in teaching undergraduate and postgraduate students because of the intuitive presentation of the data. Many other highly accessed sites (e.g. InterPro, PDB, Pfam, PSI-Knowledgebase, PDBsum) provide links to CATH-Gene3D. We will further publicise the new subclassification in workshops, e.g. within IMPACT and ENFIN. Collaboration We are involved in several collaborations with experimental groups who will benefit directly from the GeMMA classification: Protein Structure Initiative (PSI) This is a large initiative funded by the NIH in the US. We are members of the Midwest Consortium for Structural Genomics (MCSG), one of the four major centres involved in PSI. MCSG includes 8 groups comprising more than 50 structural biologists (with >200 for PSI as a whole), and the data is accessed and exploited by many other scientists involved in similar initiatives. We are currently using GeMMA to identify subfamilies within very highly populated and functionally diverse superfamilies as targets for structure determination. This follows the utlimate aim to represent each of these subfamilies i.e. functions by at least one solved structure. London Pain Consortium This Wellcome funded network includes 7 experimental groups studying neuropathic pain. GeMMA will be used to functionally characterise genes identified by proteomics and microarray studies as being associated with signalling pathways involved in pain. EU ENFIN Network of Excellence for Systems Biology This network pairs computational groups with experimental groups. We are collaborating with several experimental groups working on angiogenesis, mitotic spindle and the PLK1 and LKB1 signalling pathways implicated in cancer. As above, GeMMA functional annotations will be used to characterise genes identified by microarray and proteomics studies. Collaborations with Metagenomics Initiatives As a member of the DOE funded Centre for Structural Genomics in Infectious Diseases (CSGID) we collaborate with groups analysing metagenome sequences at the J. Craig Venter Institute (JCVI), whose annotators will exploit the GeMMA profiles. The functional repertoire of metagenomic datasets can reveal targets for structure determination, e.g. structural features of subfamilies highly expressed in enterobacterial pathogens could guide drug development.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsTechnology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Tools and Resources Development Fund (TRDF) [2006-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file