Award details

3D-Proteomics: FAIRification of proteomics data for comprehensive integration with structural biology information

ReferenceBB/V018779/1
Principal Investigator / Supervisor Dr Juan Antonio Vizcaino
Co-Investigators /
Co-Supervisors
Dr Sameer Velankar
Institution EMBL - European Bioinformatics Institute
DepartmentOMICs
Funding typeResearch
Value (£) 701,511
StatusCurrent
TypeResearch Grant
Start date 19/04/2022
End date 18/04/2025
Duration36 months

Abstract

Structural biology is one field where proteomics techniques are having an increasing impact. In the interface between proteomics and structural biology, cross-linking mass spectrometry (CL-MS) is the most popular and mature approach. Because of the complementarity to established structural methods, CL-MS has gained popularity in the structural biology community. The PRIDE database has become by far the world-leading resource, storing currently >85% of proteomics datasets worldwide. PRIDE stores >19,000 datasets, with ~1,000 (~5.4%, Nov 2020) coming from CL-MS, Hygrogen Deuterium eXchange (HDX-MS) and other MS-proteomics techniques. At present, PRIDE cannot handle the integration, access and visualisation of CL-MS data in the same way to datasets exported from standard proteomics workflows (CL-MS datasets are then labelled as "partial" submissions). During this last year, two related community white papers have been published which summarise the conclusions of a series of community meetings. These two white papers form the basis for the main objectives of "3D-Proteomics". The first white paper calls for the integration of PDB with federated data resources in other fields (mentioning explicitly PRIDE for proteomics data) (WP3 in this proposal) to e.g. better support integrative modelling approaches, being CL-MS one of the most prominent use cases that should be supported. The second white paper highlights the need to develop appropriate data standards (WP1), software tools for CL-MS data (WP2), improve data deposition (WP3), and data access and visualisation (WP4), in-line with other MS-based proteomics approaches. The two white papers clearly demonstrate the need and demand for the outputs of "3D-Proteomics". As an overall result, CL-MS data will be made 'FAIR-er' (Findable, Accessible, Interoperable and Reusable). Additionally, we will standardise the representation of post-translational modifications in both PDB and PDBe-Knowledge-Base (PDBe-KB).

Summary

Proteins are molecules found in all living organisms that provide structure and carry out most of the important functions in a cell, including catalysing (causing or speeding up) chemical reactions and signalling between different cells. Proteomics is the study of the entire set of proteins in a given biological sample such as a cell or an organism like a bacteria, plant or human. Since proteins are essential for so many crucial functions, proteomics can tell us a lot about how organisms work and also about what happens in illnesses, as well as helping to identify potential treatments. This means that proteomics is used across many areas of beneficial biological and biomedical research. Currently the primary technology used in proteomics is a technique called mass spectrometry (MS), which works by breaking up a protein into small fragments, sorting them and then reporting their mass. The quantity and identity of the protein can then be determined using different software tools. The structure of a protein is also very important, as the way that a protein is organised via folding will help it to carry out its job. The structure also determines how it is able to interact with other proteins, for example a protein that transports another protein around a cell needs to have a part that binds to it specifically. Protein structure can be studied using techniques like x-ray crystallography, which makes use of the way that different structures diffract (bend) x-rays. A more recent development called cross-linking MS (CL-MS) is a powerful tool for visualising how proteins fold and join together, and it works by running MS on proteins that are linked by specialised chemical reagents called cross-linkers. Unfortunately, CL-MS does not yet have coordinated mature open standards and existing datasets are not well linked to other information about protein structure. This means that it is difficult to compare and integrate findings between research groups and that important knowledge may be missed. It is important that proteomics databases follow the FAIR principles of being easy to find (Findable), free and open source (Accessible), easily shared and processed (Interoperable) and Reusable. Our research groups manage two world-leading databases: the PRoteomics IDEntifications database (PRIDE), which is a repository for proteomics data generated using MS, and the Protein Data Bank (PDB), which is home to 3D structural data for large molecules including proteins. This project will combine these tools with our expertise in CL-MS in order to develop FAIR data standards and software so that proteomics data generated using CL-MS has a common format and processing pipeline, and so that a suite of software tools is made available in order to process and analyse the data freely and easily. PRIDE will be extended to include these standardised CL-MS data formats, and key software tools for data deposition and visualisation will be made available. As a key point, we will create links between PRIDE and PDB in order to allow for joined-up examination of structural data, including integration between the PDB and PRIDE submission systems. This will mean that researchers will be able to more easily analyse proteins and identify links between their research and other projects, even if they don't have access to CL-MS equipment themselves. The tools and standards that will be generated by this project will benefit researchers across a wide range of biological and biomedical fields, and will provide an interface between proteomics and structural biology information that will enhance and connect research findings. The software will ensure that important and novel structural proteomics data are made accessible and findable, and the standards will maintain its interoperability and reusability. We will make sure that our work is disseminated widely and we will deliver workshops to train and assist researchers in making full use of these valuable resources.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsStructural Biology
Research PriorityX – Research Priority information not available
Research Initiative Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file