BBSRC Portfolio Analyser
Award details
3D-Gateway to protein structure and function
Reference
BB/S020071/1
Principal Investigator / Supervisor
Dr Sameer Velankar
Co-Investigators /
Co-Supervisors
Dr Maria J. Martin
Institution
EMBL - European Bioinformatics Institute
Department
MSCB Macromolec, structural and chem bio
Funding type
Research
Value (£)
481,557
Status
Completed
Type
Research Grant
Start date
01/11/2019
End date
31/01/2023
Duration
39 months
Abstract
Despite significant advances in protein structure determination, the majority of proteins have no experimental structural data. Significant improvements in structure prediction methods can fill this gap and provide valuable data for understanding protein functions. Presently, structure data are archived in distinct resources (the PDB for experimental structures, and Genome3D and other specialist resources for predicted models) impeding their access by the wider user community. The 3D-Beacons infrastructure will allow seamless access to all structure models providing a mechanism for maximising structural coverage of UniProtKB. The 3D-Beacons network will also simplify the comparison of models from different model repositories, allowing development of better confidence measures. Collaboration of resources will ensure the sustainability of the system, and the proposed uniform data access mechanism (REST API) will simplify integration of structural data by other resources such as InterPro and Ensembl and tools such as JalView, Chimera, Pymol to provide an essential foundation for understanding the impacts of genetic variations on protein functions. Access to model structures is also valuable in structure determination and analysis pipelines. We will also develop mechanisms for transferring structure-based functional annotations from the PDBe Knowledgebase to UniProt proteins, and derive a confidence measure for the annotations. This infrastructure will allow integration and display of these annotations for UniProt sequences from key model organisms including important agricultural organisms and their pathogens. Furthermore, these functional annotations will be built into the UniProt UniRule system enabling both (i) large scale annotation of the UniProt KnowledgeBase (UniProtKB) and (ii) their use by other groups annotating completed genomes or metagenome data, through the UniFIRE (the UNIprot Functional annotation Inference Rule Engine) system.
Summary
Proteins comprise long chains of organic molecules that fold into compact globular 3-dimensional structures. Knowing this structure can give very valuable insights into the clefts, pockets or other surface features important for binding other molecules in the cell eg small molecules or proteins. Knowledge of the structure is also essential for designing drugs that bind to these features and inhibit the protein and can also help in understanding whether mutations in the protein's residues affect its stability or function, leading to disease. Experimentally determining the structure can be challenging, which is why only a small percentage of known proteins (~145,000 out of 120 million) have been characterised. However, powerful computational methods have been developed that predict protein structures by inheriting structural information from evolutionary related proteins whose structures are known. These prediction techniques have been made even more powerful, recently, as new ways of exploiting the evolutionary data have been found that more accurately constrain contacts in the protein. Applying these techniques, structures can be predicted for a large proportion of uncharacterised proteins. For example, for human proteins about 5% of the structures are known but a further 88% can be modelled, some to very high accuracy, thereby providing important frameworks for designing drugs to treat human diseases. When inheriting structural data between distant relatives one has to be much more cautious and most prediction methods return a confidence score for the models produced. This project will build an infrastructure (3D-Beacons) that aggregates experimentally determined structures with predicted structures generated by groups applying different algorithms. This will be done for proteins from selected organisms relevant to food security and human health - some will be pathogenic bacteria that threaten humans or animals/crops. We will use this data to annotate proteins in the UniProt resource, widely used by more than 750,000 unique users each month. Since the prediction methods reside in many different labs, by pooling the data in this way we can significantly increase the number of proteins with structural data. In addition, combining models built by independent algorithms allows us to compare 3D-models to find which parts agree regardless of method and which parts vary between methods and are clearly harder to model. Therefore, we will use this aggregated data to research the best strategies for calculating model quality at each position in the protein. We will build web pages to display the known and predicted structures for a given protein. It can be difficult to determine the structure of the whole protein so, where appropriate, we will display both experimental and predicted structures, taking great care to label the structures with information on the source (eg method used) and reliability of the data (eg confidence). We will also use our 3D-Beacons infrastructure to aggregate information on known and predicted functional sites on the protein structure and display this data on web pages, together with information on source and confidence. The site data mapped onto structure will be particularly helpful for developing rules that allow us to gauge whether a protein with no experimental characterisation has the same function as an evolutionary related protein with experimental characterisation. Relatives sharing the same function should have the same key functional site residues. With these rules we will be able to provide structural and functional annotations for millions of proteins in UniProt. The new data will represent a tenfold or more increase in the number of UniProt sequences which have structural and functional site information. UniProt is also widely used by researchers in industry and thus this expansion in information will have a very significant impact.
Impact Summary
Protein structure data provides valuable insights into the mechanisms by which proteins function and can thus provide explanation for impacts of genetic variation. It also aids drug design and protein engineering e.g. for greater stability or higher catalytic efficiency. The impact of structural data is evident from the significant uptake of the data by the community. For example, structural data in PDBe is accessed by >60,000 unique users/month. Genome3D is an integrated resource with structural data from 5 world-leading UK resources, whose sites typically attract 10,000 - 15,000 users/month. The Genome3D data is also disseminated via InterPro which has 135,000 unique users/month. Despite significant advances in protein structure determination, a significant proportion of proteins have no experimental structural data in the PDB. However, protein structure prediction methods have improved significantly and the models produced can provide valuable data for understanding protein functions and the impacts of genetic variations. By expanding the predicted structural data in Genome3D and implementing the 3D-Beacon network to integrate additional predicted data from other internationally acclaimed resources (i.e. ModBase, Rosetta, SWISS-MODEL), we will maximise the structural coverage of sequences in UniProtKB and provide valuable data benefitting a very wide community of biologists. As well as aggregating known/predicted structural data, 3D-Beacon network will aggregate structure-based functional annotations from PDBe-KB. Our 3D-Gateway pages will be carefully designed to display all this information for a given UniProt sequence, in a highly intuitive manner that makes the source of, and confidence in, the data clear. The impact of UniProt in the biological community is extremely high, with access by >750,000 biologists each month. Our project has clear deliverables likely to have impact on research studies: (1) 3D-Beacon network based aggregation of structural and functional data will also allow individual groups to download aggregated structural data for sets of UniProt proteins. This gives a mechanism for other data providers, e.g. Interactome3D, to combine the data with their information, e.g. on protein interactions and drug targets. (2) Dedicated 3D-Gateway webpages showing structural and functional annotations will provide biologists access to functional information on a protein they are studying. In this context information provided by UniProt on known disease variants will be enriched by structural and functional information, provided by our 3D-Gateway project, highlighting key residues. (3) The incorporation of structural and functional annotations in UniRules will allow safe transference of annotations to an even greater set of UniProt sequences and these rules will also be available to genome curators to enable functional annotation and comparative genome studies. Industry will also benefit from the structural and functional annotations of UniProt sequences on the 3D-Gateway pages, to guide drug design. As an activity in the ELIXIR Community of Structural Bioinformatics, 3D-Beacons will make the aggregated data available to groups across Europe and beyond, who in turn will contribute their own data to 3D-Beacons for display on the web pages. This community will also be involved in exploring mechanisms to ensure the quality of the aggregated data (e.g. by highlighting outlying data and developing sound confidence measures). Furthermore, the ELIXIR 3D-BioInfo Community is building links with the ELIXIR Rare Disease and Galaxy Communities to develop workflows for accessing known and predicted structural data. ELIXIR funding is already supporting development of web-based training workflows in TESS, by PDBe and Genome3D, for exploiting structural data to gauge the impacts of genetic variations. This training material will ensure wider uptake and exploitation of data from the 3D-Gateway project.
Committee
Research Committee D (Molecules, cells and industrial biotechnology)
Research Topics
Structural Biology, Technology and Methods Development
Research Priority
X – Research Priority information not available
Research Initiative
Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
Funding Scheme
X – not Funded via a specific Funding Scheme
Associated awards:
BB/S020144/1 3D-Gateway - Gateway to protein structure and function
I accept the
terms and conditions of use
(opens in new window)
export PDF file
back to list
new search