Award details

Expanding Genome3D and disseminating the structural annotations via InterPro and PDBe

ReferenceBB/N019253/1
Principal Investigator / Supervisor Professor Christine Orengo
Co-Investigators /
Co-Supervisors
Dr Ian Sillitoe
Institution University College London
DepartmentStructural Molecular Biology
Funding typeResearch
Value (£) 386,521
StatusCompleted
TypeResearch Grant
Start date 01/07/2016
End date 30/06/2019
Duration36 months

Abstract

1. Improve SCOP/CATH mapping to increase structural data integrated in InterPro We have already developed a protocol which identifies domain residue ranges for a given PDB classified in CATH or SCOP. The overlap between ranges is calculated to determine whether domains are equivalent. Two superfamilies are judged equivalent depending on the percentage of equivalent domains. Recent work by the PDBe in a joint project with CATH - ending November 2015 - has explored more sophisticated approaches. These examine the multi-domain contexts of the domains being compared and identify blocks of equivalent multi-domain architectures between two superfamilies. This will be further developed to increase the number of equivalent superfamilies. The SCOP/CATH mapping will be exploited in new protocols for integrating predicted structural data into InterPro. 2. Develop a 3D viewer to view sequence variations in a structural context Displaying structural data in a way that works reliably on different web browsers is a challenge - especially if they have additional components (e.g. features to show conserved positions). Whilst 3D models can be viewed on the Genome3D website using the JSmol viewer, structures are not integrated with sequence data. There are excellent Java-based tools for analysing protein sequence and structure (e.g. JalView, Jmol), however working with Java in modern web browsers is no longer viable due to security concerns. Additionally, since JSmol is a direct port from the large java codebase of Jmol, this presents significant limitations for future development: large web footprint etc. However, alternatives are being developed that address these limitations. We will evaluate the available 3D molecular viewers, identifying robust candidates that conform to web standards (HTML5/WebGL). This viewer will be integrated with other JavaScript components to provide an intuitive, interactive and reusable structural feature viewer.

Summary

The structure of a protein dictates the manner in which it interacts with other proteins and whether or how it binds and changes the compounds it is exposed to. Knowing a protein's structure can help rationalise the mechanism by which it performs its biological role. It is also important for understanding how genetic changes such as mutations in the residues that make up the protein, can destroy or modify the way in which it performs that role. Revolutionary new technologies in biology, known as next generation sequencing, are now allowing biologists to collect vast amounts of genetic variation data. For example, information on changes in the sequences of proteins collected from humans suffering from different diseases like cancer or heart disease. Alternatively, sequences of proteins from species important in an agricultural context. For example different strains of wheat that may be more resistant to frost or produce higher yields. However, it is much harder and more expensive to determine the 3D structure of a protein than its sequence. It is particularly difficult for human, mouse, chicken, plants and other eukaryotic organisms that we need to study to understand disease or ensure food security. Currently, on average less than 15% of proteins from these important model organisms have an experimentally determined 3D structure. To address this deficit of structural data, algorithms have been developed for predicting the structure of a protein. The most successful approaches identify a relative having a known structure and inherit 3D information by exploiting the known conservation of structural features between evolutionary related proteins. Five of the top world-leading resources generating such annotations are based in the UK (SUPERFAMILY, Gene3D, Phyre, Fugure, pDomTHREADER). These exploit structural relatives in the SCOP and CATH structural classification - the two world leading resources capturing information on domain structures - to use as templates for predicting structures of uncharacterised relatives. The Genome3D resource, which was launched in 2012, integrates domain structure predictions from all five resources for ten model organisms used to study biological systems and important for the study of human health (e.g. human, mouse) or agriculture and food security (e.g. plant). Although the algorithms used by the resources are powerful for recognising even very remote relationships and inheriting structural information between relatives, their accuracy is < 90%. However, by combining all the data in a single resource and identifying positions in the protein where all the methods agree, it is possible to provide much more reliable annotations. Since it is easier to find these consensus regions if equivalent sets of relatives (i.e. families) in SCOP and CATH have been identified, a large part of the project involves mapping between these resources. We now wish to continue this project, improving the mapping of SCOP and CATH and using this to increase the amount of reliable consensus data that Genome3D provides. We will include additional organisms important for health and agriculture. However, a major benefit from this project will be the integration of the Genome3D structural data with structurally uncharacterised sequences in InterPro, a world-leading resource that combines information on protein families from 11 different resources worldwide. By including Genome3D data for families in InterPro we will be able to increase the number of proteins for which we can provide structural data ten-fold. In addition we will provide a very intuitive web-based viewer for looking at the structures and assessing the likely impacts of any changes in the sequence on the function of the protein. Since many biologists are unfamiliar with the value of structural data in assessing genetic variations we will develop web-based training material and arrange workshops both in our institutes and at international meetings.

Impact Summary

The data provided by the project is essential for a wide range of biologists and this proposal addresses key strategic areas for the BBSRC in Data Driven Bioscience: (1) Improved accuracy of structural data used by structural and computational biologists to analyse protein evolution and predict protein structures and functions; (2) Generation of consensus data that will aid the provision of structural annotations for millions of protein sequences in InterPro, and hence UniProt. Such annotations will be critical for understanding the impacts of genetic variations in these proteins i.e. that could be causing disease in humans or animals or modifying the efficiencies of the proteins in different crop and animal strains. Currently, InterPro contains less than two thirds of the structural annotations in Gene3D and SUPERFAMILY and none of the predictions from PHYRE, FUGUE, pDomTHREADER. By integrating data from all 5 of these Genome3D resources this project will significantly increase the amount of structural data available to biologists. Collaborations between PDBe, SCOP and CATH to map between SCOP and CATH and to develop a platform for assigning domain boundaries to new structures will be incredibly valuable for increasing the numbers of PDB structures classified. Currently <80% of structures in the PDB are classified in either SCOP or CATH and these collaborations will share the task of manual curation - the most time consuming aspect of the classification. Dissemination through websites and workshops As evidenced by the web statistics (CATH and SCOP > 10,0000, InterPro 135,000 and PDBe 45,000 unique users/month), data generated by all resources is widely used by biologists both in academia and in industry. Companies frequently use the resources to determine the structures and functions of query proteins. Recent analyses of web statistics by Genome3D groups showed that ~20% of accesses came from industry. Furthermore the algorithms and data provided by FTP downloads are used by a number of pharmaceutical companies including Pfizer India, Cubist, Lilley Pharmaceuticals. In addition to providing information on equivalent superfamilies the project will provide a range of other consensus data valuable for both academia and industry. For example consensus data on domain boundary assignments will be highly valuable for structural biologists in pharmaceutical companies to guide the generation of domain constructs for structure determination We will publicise the SCOP/CATH mapping, consensus data and integration of Genome3D in InterPro by presenting at a Technology track of the annual ISMB conference which typically has participants from industry. We will hold a Genome3D workshop at UCL in Dec 2018 to present the integration in InterPro and PDBe. Results will also be reported at an EBI workshop at which Orengo regularly presents and at a Bioinformatics course at UCL which is open both to academics and researchers from industry. We will aim to publish in NAR database 2017, 2019. Interaction with the Public UCL, hosts visits by 6th form science students at which the Orengo group give presentations on domain structure classifications and the benefits of using protein structure to understand protein functions and the impacts of genetic variations. UCL is one of 6 Beacons for Public Engagement in the UK and has a dedicated Public Engagement Unit that will provide training. All the PIs and have expertise in communicating scientific strategies and discoveries to the public. Training received by the research project staff Researchers in all the groups will be working closely. Researchers will receive hands-on training from other PDRAs in the Orengo, Finn and Velankar groups. All the institutes have excellent training schemes and career development courses and the PDRAs will be working in world class laboratories of internationally renowned scientists. They will have opportunities to present their work within the groups.
Committee Research Committee D (Molecules, cells and industrial biotechnology)
Research TopicsStructural Biology
Research PriorityX – Research Priority information not available
Research Initiative Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file