Award details

Intermediate-to-low resolution feature detection in cryoEM maps using cascaded neural networks

ReferenceBB/T012064/1
Principal Investigator / Supervisor Dr Martyn Winn
Co-Investigators /
Co-Supervisors
Dr Agnel Praveen Joseph, Dr Jeyan Thiyagalingam
Institution STFC - Laboratories
DepartmentScientific Computing Department
Funding typeResearch
Value (£) 119,358
StatusCompleted
TypeResearch Grant
Start date 01/04/2020
End date 31/03/2021
Duration12 months

Abstract

Cryogenic electron microscopy (cryo-EM) currently enables structure determination of large macromolecular machines at close to atomic resolutions. However, 88% of cryo-EM data (in the EM DataBank) are worse than 3.5Å and the average resolution currently achieved using single particle analysis is only 5.7 Å. Further developments in the field are likely to bring this number down but structure interpretability and model validation beyond 3.5Å is a clear challenge at the moment. We aim to extend the structure interpretability beyond 3.5Å by generating feature libraries of different size ranges and train machine learning models to detect these features in single particle maps and sub-tomogram averages. We propose to use libraries of structural features at three levels based on the spatial extent: 1) 'secondary structure like' motifs comprising alpha helix, beta strand, polyproline helix, etc and other frequently occurring small motifs identified by the PDBeMotif database 2) Sub-folds made up of unique arrangements of two secondary structure elements 3) Sub-folds made of more than two secondary structure elements To generate the last two libraries, we will use a method to segment protein structures based on amino acid contacts, and cluster them by shape. We will develop a machine learning model to recognise these features in maps at different resolutions, using existing fitted models for training and testing. We will test different deep learning architectures to address this problem of multi-label (feature) segmentation. Larger features are composed of a unique arrangement of smaller features, and hence the contextual/neighborhood information and hierarchical nature (cascaded architecture) are important. An additional network will interpret the output features in terms of overall fold. Upon testing the proof of principle, this work could be extended to assign sequences and build structural models by assembling the features together.

Summary

Understanding the function of biomolecules is fundamental to comprehend how life is sustained and design specific therapeutics for diseases associated with their function. Proteins form the largest fraction of cell constituents and often assemble together, and also with other biomolecules, into large molecular machines that perform vital roles in many cellular processes. The three dimensional (3D) structure of a molecular machine forms the platform for its function and determining the 3D structure is crucial to understand the details of its activity. Cryogenic electron microscopy (cryo-EM) has had an immense impact on the structure determination of such large molecular assemblies in a near native state. These assemblies can either be studied in isolation (single particle analysis) or in the native cellular environment (electron tomography). Advances in technology and software for cryo-EM have helped to push the level of detail that can be discerned. Nonetheless, intrinsic properties of biological samples often make them less amenable to high resolution structure determination. 88% of cryo-EM structures deposited in the public repository EMDB are worse than 3.5Å resolution, and therefore don't contain atomic detail. The 3D structural details of a macromolecular assembly are obtained as a density map. Interpreting details of the map requires detection of structural features of the components of the assembly. Intermediate (between 3.5 Å to 6Å) and low (>6Å) resolution maps are extremely difficult to interpret using standard automated tools. Available methods usually detect structural features by six-dimensional search procedures that are computationally expensive and are associated with a large number of false-positives. Moreover, for most of these methods, it is required that the structural details of each component of the assembly is known. Related problems that go in hand are validation of features derived from low resolution map data and representation of theselow resolution models themselves. The basic structural organization of protein structures and the process of 3D folding from a 1D sequence of amino acids have been studied over several decades. Proteins use a finite set of modular features like secondary structures and folds, and the functional form is formed of a unique arrangement of these features. An intermediate level of features is also observed where a few secondary structures organize into stable motifs or sub-folds. We plan to exploit the hierarchical feature organization of protein structures and using powerful deep learning approaches established for pattern recognition we aim to address the problem of feature recognition in intermediate-to-low resolution maps. We will use structural feature libraries of different sizes ranging from secondary structures and smaller motifs (e.g. turns of the protein chain) to sub-folds and folds. A specialized set of motifs or sub-folds covering the intermediate size features will be generated based on compactness (contacts). Deep neural network architectures will be designed to detect these 3D structural features in the map, with layers arranged to reflect the structural hierarchy. We also plan to use the developed networks for validation of existing structure models derived from low resolution data. In the future we would like to extend this work to potentially build structural models by assembling the features using additional sequence based information. The developed approach would help to extend structure interpretability at intermediate and low-resolutions and make better use of such data to get insights into the mechanisms of biological function. The proposed development will be implemented as a user-friendly tool and distributed to the scientific community. We anticipate that other scientific fields could potentially benefit from the machine learning architecture designed for such multi-label 3D segmentation from noisy data.

Impact Summary

Understanding the function of biomolecules is fundamental to comprehend how life is sustained and to design specific therapeutics for diseases associated with their function. Proteins form the largest fraction of cell constituents and often assemble together, and also with other biomolecules, into large molecular machines that perform vital roles in many cellular processes. The three dimensional (3D) structure of a molecular machine forms the platform for its function and determining the 3D structure is crucial to understand the details of its activity. In this proposal, we aim to develop a tool which will help experimental structural biologists interpret 3D volumes of molecular machines determined by the techniques of electron cryo-microscopy and tomography. Cryogenic electron microscopy (cryo-EM) has had an immense impact on the structure determination of such large molecular assemblies in a near native state. These assemblies can either be studied in isolation (single particle analysis) or in the native cellular environment (electron tomography). Cryo-EM has been adopted widely by the academic community studying the molecular basis of disease or developing biotechnology applications. The technique has also been adopted in the last couple of years by the pharmaceutical industry, agritechnology and biotechnology companies for the insight it gives for example on particular drug targets. The vast majority of economic and societal impacts of this work will be achieved indirectly by improving the outputs of these academic and industrial scientists. We will integrate the tool into the software suite of the Collaborative Computational Project for Electron cryo-Microscopy (CCP-EM) which is already used by thousands of structural biologists in academia and industry. CCP-EM also organises and hosts several training workshops per year, and has close links with the electron Bio-Imaging Centre (eBIC) on the Harwell campus. We will also work closely with our collaborators at the Electron Microscopy and Protein Data Banks to see how the tool can improve the interpretation and annotation of structures already deposited in their databases. This could lead to vital new insights into known structures, and impact on the many downstream users of these databases. There could be additional academic or industrial beneficiaries, users of our software, or software libraries and algorithms, in domains where 3D pattern recognition from noisy data is required. The PDRA will receive valuable training in the specialist areas of structural biology and machine learning. The role will expose the incumbent to multidisciplinary techniques, and add a valuable skillset to the UK workforce. STFC is active in public engagement activities. STFC has hosted visits from school parties and engage in providing basic scientific exposure and internships to students at school and graduate levels. The imaging from electron microscopy is very visual in nature, and is an excellent focus help make a connection between science and biology and the everyday world most people experience. Our aim is to raise the awareness of people to science who would not otherwise have contact with it, and to inspire school children to take an interest. We have a media officer who targets alerts to the public press or trade publications with topical findings.
Committee Not funded via Committee
Research TopicsStructural Biology, Technology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Tools and Resources Development Fund (TRDF) [2006-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file