Award details

GENOME-3D: UK network providing structure-based annotations for genotype to phenotype studies

ReferenceBB/I024984/1
Principal Investigator / Supervisor Professor Sir Tom Blundell
Co-Investigators /
Co-Supervisors
Institution University of Cambridge
DepartmentBiochemistry
Funding typeResearch
Value (£) 64,975
StatusCompleted
TypeResearch Grant
Start date 01/12/2011
End date 30/11/2012
Duration12 months

Abstract

We will develop the GENOME-3D: (1) website - presenting integrated information from the consortiums resources (2) webserver - allowing users to submit query sequences/structures to run against the consortiums methods and return consensus predictions. (1) GENOME-3D website We will develop SOAP/REST based web services for: - Exporting data from individual resources to GENOME-3D ie domain boundaries/superfamily classifications/domain structure predictions - Combining data, identifying consensus regions and calculating confidence values We will develop Taverna workflows which plug together the above web-services to provide consensus data. We will build a web portal to display this data (see figure 1 main text). The website will exploit an Oracle database and will provide facilities for querying with protein structure ids (PDB ids) or sequence ids (UniProt or GI codes). All partners have extensive experience in web design. CATH-Gene3D has tools for visualising multiple structure/multiple sequence alignments and highlighting conserved residues on representative structures. These will be adopted by GENOME-3D. We will design a questionnaire to capture feedback on the site and use this to improve design. (1) GENOME-3D webserver As well as providing predetermined classifications/annotations via the website (some data is manually curated), we will establish a server that allows structure/sequence based queries and automatically returns consensus domain classifications/predictions (no manual curation). We will develop SOAP/REST based web services for: - Scanning query structures against classification methods ie structure comparison (CATHEDRAL) and homologue recognition (HMMscan) to give uncurated SCOP/CATH assignments. - Performing multiple structure alignments - Scanning query sequences against individual methods predicting domain structures and structural features eg membrane regions - Generating consensus data from multiple prediction methods

Summary

The 3D structures of proteins are essential to fully characterise the sites mediating their molecular functions and their interactions with other proteins. However, whilst revolutionary technologies have enabled the sequencing of thousands of complete genomes, it is more challenging to determine the 3D structures of the proteins. Although the sequence repositories now contain >10 million protein sequences, less than 70,000 protein structures have been determined. Fortunately, in parallel with developments in sequencing technologies, powerful computational methods have emerged to predict the structure of a protein from its sequence. Currently these methods provide putative structures for ~80% of domain sequences from completed genomes, although the accuracy of this data varies from reasonably precise when structures are modelled using templates based on close relatives, through to quite approximate for models based on remote relatives and where proteins have no structurally characterised relatives. This project will bring together 6 internationally renowned UK groups involved in (1) classifying protein domains into evolutionary families (as this facilitates structure and function prediction) and/or (2) protein structure prediction. As regards the first activity - classification of protein structures - the two groups involved (SCOP,CATH) are the only groups, worldwide, providing this data. However, each applies somewhat different methodologies to make their assignments. Collaboration between these groups, in GENOME-3D, will involve comparison of domain structures and family classifications leading to refinements of assignments and/or confidence levels where the methods disagree. Since manual curation of the data is essential and since the rate at which the structures are determined is increasing, collaborations will speed up classification by allowing the groups to share information on the more challenging assignments and to discuss outcomes. For the second activity, structure prediction, the groups involved use technologies that vary in their sensitivity and in their ability to handle large numbers of sequences. Whilst SUPERFAMILY (based on SCOP) and Gene3D (based on CATH) provide greater coverage they are less likely to recognise very remote homologues, where methods such as GenTHREADER, Phyre, Fugue perform better. For each sequence, we will combine predictions from these different resources and assign confidence for each residue position in a query sequence based on the number of methods that agree in their structural prediction. We will provide pre-calculated assignments and also allow dynamic queries on the methods. We will also build 3D models for the sequences with residue positions highlighted according to agreement between the methods. We will develop computational platforms that integrate the information provided by each resource. To distribute this data to the biological and medical community we will build a dedicated web site. We will also establish web servers that link the methods ie run all the methods on query sequences and then report consensus assignments and highlight differences. In addition the consensus classification and annotation data will also be provided via two major international sites - the PDBe and InterPro. The sequence repositories are expanding at phenomenal rates as metagenomics and next gen sequencing initiatives bring in sequences from diverse microbial environments and report sequence variants occurring across different human populations or associated with different disease phenotypes. Structural data will enhance the insights available from this data. For example, known or predicted structures can reveal whether residue mutations oc

Impact Summary

SUMMARY OF RESOURCE This proposal is to establish a resource (GENOME-3D) for the bioscience and biomedical communities to access an integrated source of information on the 3D structures of proteins and relate this data to protein function. GENOME-3D will consist of information generated from major UK groups in structural bioinformatics. The individual resources are extensively used by the community - the combined access to the different databases is over 50,000 visits per month and the total number of jobs run on all the servers is 20,000 jobs per month. This testifies to the importance of this structure and functional information for both the academic and commercial communities. Producing a combined resource will enhance the value of the individual components by enabling comparisons and cross-referencing. The impact of the resource will be extensive and span most of the applications of bioscience and biomedical research. This proposal is endorsed by letters of support from several major UK pharmaceutical ,biotech and agricultural companies - Syngenta, UCB, GSK, Isogenica, Heptares, Syntaxin and Astex. SCIENCE COMMUNITY Food security - Increasingly the sequences of plants, agricultural pests and agents of disease will be the focus of genome sequencing and structural studies. GENOME-3D will assist in the interpretation of the relationship between sequence variations between strains in the plants and help in the identification of the best strain to meet objectives such as yield, water requirements, colour and taste, and resistance to pests and disease. The information could benefit chemical discovery and marker identification for crop breeding programs. Bio-energy and bio-industry - The manipulation of individual molecules and of pathways will be central in the exploitation of bioscience to yield new sources of energy and materials. Synthetic pathways can be engineered to make molecules, such as fuels, more efficiently. In addition, novel molecules can be designed and synthesised. Detailed knowledge of structure of a family of protein can be used to suggest the critical changes to alter function. At the pathway level, GENOME-3D will help to identify the components based on sequence and structural information of families of proteins. Health - The central role of protein structure in the design of novel and improved pharmaceuticals is well established. Provision of the highest quality 3D models from gene sequence will therefore directly enhance the discovery of new hits. The refinement of these hits into leads will benefit from information about a family of molecules to highlight the relationship of stereochemistry, ligand binding, and activity. Therapeutic molecules will span the spectrum from low molecular weight compounds, through peptides into proteins, including antibodies. A major development over the next few years will be the sequencing of many individuals and relating their sequence variations (single nucleotide polymorphisms, SNPs) to disease susceptibility. This will provide major insights into biological processes in human, the development of personalised medicine, and the identification of novel drug targets. Central in the interpretation of SNPs effects in protein coding regions will be knowledge available in GENOME of the inter-relationships between protein sequence, structure, function and pathways. POLICY MAKERS AND THE LAY PUBLIC GENOME-3D will be an integrated resource with several UK groups working together to develop a world-leading bioinformatics resource. The success of the project could inform policy makers about the value of collaborative work for bioinformatics and other scientific resources within the UK, within Europe and worldwide. Similarly, GENOME-3D can serve as an example to the general public (including schools) which demonstrates both bioinformatics resources and the added value of collaborative research.
Committee Research Committee D (Molecules, cells and industrial biotechnology)
Research TopicsStructural Biology, Technology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file