Award details

BBSRC-NSF/BIO - Expanding fold library in the twilight zone to facilitate structure determination of macromolecular machines

ReferenceBB/S017135/1
Principal Investigator / Supervisor Dr Sameer Velankar
Co-Investigators /
Co-Supervisors
Institution EMBL - European Bioinformatics Institute
DepartmentMSCB Macromolec, structural and chem bio
Funding typeResearch
Value (£) 337,453
StatusCurrent
TypeResearch Grant
Start date 01/11/2019
End date 30/09/2023
Duration47 months

Abstract

We will significantly expand structural coverage of sequence space by applying a powerful method, Rosetta, to predict structures of novel folds or very remote homologues to known folds. Recent developments, detecting co-evolving, contacting residues, exploit vast sequence data and have revolutionised structural biology. The method is also valuable for macromolecular assembly by predicting co-varying residues forming interfaces. The quality of Rosetta models will be improved by using multiple sequence alignments (MSAs) from FunFams, clusters of structurally coherent relatives. FunFams will be vastly expanded with metagenome sequences, to increase sequence diversity giving deeper, more informative MSAs. The Baker group established a vast library of metagenome sequences from collaborations with Joint Genome Initiative (14522 metagenome sets) and include 27 algal, 92 plant, 772 fungal, 142 worm, 48 bird, 93 insect, 370 eukaryotic genomes from Ensembl and 1915 curated fungal genomes, giving a total of 9 billion sequences. Fast protocols to generate coarse clusters will cope with this vast data by exploiting k-mer hashing, followed by HMM-HMM protocols. Subsequently, the FunFamer algorithm will identify structurally coherent FunFams in each coarse cluster. This exploits HHpred for HMM-HMM comparison, groupsim for SDP detection and generates multiple sequence alignments using MAFFT. Rosetta predicted "interface" residues will enhance PISA prediction to identify biological assemblies. We will analyse assemblies annotated in the PDB to validate predictions and use known interface information from IntAct. Predicted structures will be integrated in Genome3D and novel confidence measures developed. Novel web visualisations will show known and predicted structures, enabling clear differentiation. Complementing experimental structures in PDB with predicted models in Genome3D will help elucidation of large structural complexes by EM and by molecular replacement.

Summary

The Protein Data Bank (PDB) is the single global archive of three-dimensional (3D) structures of large biological molecules. PDBe (pdbe.org) is the European partner in the global consortium managing the PDB. PDB is one of the oldest biological archives, with 144,000+ entries and nearly 2 million downloads daily by users worldwide in academic or industry settings, working on topics ranging from food security, human health through to design of more efficient enzymes in various aspects of biotechnology. Despite a steady increase in its holdings (13,000+ entries added in 2017), the growth of the PDB is far outstripped by the growth in the available protein sequence data. Resources like Genome3D (genome3d.eu), funded by the BBSRC, aim to fill the gap in structure coverage of the protein sequence space with reliable predictions of structures. This resource combines data from a number of UK and overseas groups who apply complementary methods for protein structure prediction. These approaches largely model proteins that are closely related to a protein of known structure (ie the protein relatives share more than 30% identical residues in their sequences). The Rosetta method for predicting protein structures, a world-leading approach developed by the Baker lab in the USA, was recently enhanced with information derived from evolutionary analyses of protein sequence data, yielding reliable models even for cases where sequence identity between the model and the available experimental structures is very low (below 30%). We will integrate Rosetta models into Genome3D to expand the coverage of structural data for important organisms for health (e.g. human) and food security (e.g. wheat). This project will also enrich both the experimentally determined and computationally predicted structures with valuable functional annotations, such as information pertaining to surface interfaces, a key ingredient in understanding how proteins interact with each other and with other biologicalmolecules. By focussing on proteins dissimilar to those with known structures, this portal will help fill the gaps in structure coverage of the protein sequence space and will make structure data much more readily available and accessible. Finally, novel visualisation tools integrating the presentation of the predicted and experimentally determined structures will be developed, maintaining a clear distinction between what is predicted and what is experimentally determined. The expanded set of 3D models derived from this project will in turn help to expand the coverage of sequence space even further, since these models can be used to guide the experimental determination of protein structures being obtained by powerful new structural biology techniques like cryo-Electron Microscopy (EM). This project will also endeavour, where possible, to improve the assembly of individual protein structures into macromolecular complexes which can be analysed to determine their biological role. We anticipate that scientists in both academia and industrial sectors (e.g. pharmaceutical companies) will benefit from access to such an integrated portal, assisting them in designing new medicines, understanding the mechanism of disease, or in designing proteins with novel properties. Recent "resolution revolution" in Electron Microscopy allows near routine determination of structures of large molecular machines, and is in need of a large repertoire of "building blocks" in interpreting the experimental results, a need which will be partially addressed by the new portal and its provision of expanded domain structure libraries. The portal will also have ways to access the assembled data programmatically, benefiting power users: software developers and maintainers of other resources.

Impact Summary

The accuracy and reliability of predicted 3D structure models built from close homologues (>50% sequence identity) is clearly demonstrated by their frequent use in X-ray structure determination pipelines as templates for molecular replacement. Recently, powerful new approaches have emerged that allow prediction of reliable 3D structure models for more remote homologues, even below 30% sequence identity, based on predicted residue contacts. These approaches use co-variation information derived from vast amounts of sequence data. The methods also facilitate modelling of molecular assemblies by predicting cross-subunit co-variation of residues forming the assembly interfaces. Many predicted 3D models are not archived in a centralised repository, but a recent BBSRC funded resource, Genome3D, integrates predicted 3D models, built by complementary methods for sequences of important model organisms (eg human, mouse, wheat, E.Coli). Genome3D is therefore the obvious home also for the models derived using residue co-variation information and this project will significantly expand Genome3D with accurate models for protein domains sequence remote from known structures and likely to have significant structural novelty. We will build a web portal displaying known and predicted structures together to ensure maximum impact of the experimentally and computationally obtained 3D structure models and develop appropriate visualisations, allowing users to easily distinguish the experimentally determined models and annotations from the computationally derived structure models and predicted annotations. Major beneficiaries of this data will be structural biologists who will be able to use the expanded library of domain structures for molecular replacement and for interpreting electron microscopy data. These libraries and associated predictions of interface residues will considerably facilitate the assembly of large macromolecular complexes and thereby provide important insights into the biological role of the proteins. The other major beneficiaries of this new portal will be biologists in academia and industry using the structural data to guide drug design and the design of new proteins. Protein structure data is also key to understanding whether a residue mutation is likely to disrupt the structure or modify the function of a protein. Extensive next generation sequencing projects increasingly reveal these genetic variations (e.g. for different strains of wheat) and biomedical researchers and food biologists will therefore greatly benefit from being able to interpret this variant data in a structural context. In 2017, the structure data in PDB was downloaded >500 million times by >500K distinct users via the PDBe website (pdbe.org). Genome3D (genome3d.eu) is a relatively new resource with lower exposure, but BBSRC funds integration of the Genome3D data in InterPro, a very highly accessed resourced with >90,000 users per month. There will therefore be a large user community of life scientists from academia and industry, who will benefit from the availability of these data. In summary, the impact will be realised by: 1. Direct use of the resources by the non-academic sector such as pharmaceutical companies who extensively use macromolecular structure models in target identification and design of compounds. The availability of the combined experimental and computational models will also help in the design of modified proteins by the synthetic biology community 2. The models will also aid interpretation of the impact of disease specific variants providing possible molecular explanations for the observed phenotypes 3. The structural biology community will benefit from access to 3D structure models together with the information on interfaces for interpretation of Electron Microscopy electric potential maps. The models will also serve as search templates for molecular replacement in crystallographic structure determination pipelines
Committee Research Committee D (Molecules, cells and industrial biotechnology)
Research TopicsStructural Biology, Systems Biology, Technology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative UK BBSRC-US NSF/BIO (NSFBIO) [2014]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file