Award details

BBSRC-NSF/BIO. Globally harmonized re-analysis of Data Independent Acquisition (DIA) proteomics datasets enables the creation of new resources

ReferenceBB/X001911/1
Principal Investigator / Supervisor Dr Juan Antonio Vizcaino
Co-Investigators /
Co-Supervisors
Institution EMBL - European Bioinformatics Institute
DepartmentProteomics
Funding typeResearch
Value (£) 493,010
StatusCurrent
TypeResearch Grant
Start date 01/11/2022
End date 31/10/2025
Duration36 months

Abstract

Proteomics is a key technology for life sciences research, enabling large scale quantitative measurement of many proteins under different conditions. Recently there has been rapid growth in data independent acquisition (DIA) mass spectrometry (MS) for quantitative proteomics over the more traditional data dependent acquisition (DDA). DIA has the potential to deliver more reproducible measurements, with fewer missing values, but relies on more complex informatics for identifying proteins, often employing previously annotated spectral libraries (SLs). We have a leading role in enabling unified international data deposition and access via the ProteomeXchange (PX) consortium of databases. There are now 1000s of raw DIA datasets in the public domain with vast potential value for informing on the biology of the samples analysed. However, the value is mostly locked at present, since the public records are lacking the SLs used to identify proteins, and there is a knowledge gap in how to use public SLs reliably to re-analyse datasets at scale. There is also significant potential for errors in SL construction to be "silent", meaning that incorrect protein identifications in published results cannot be detected and worse, if the same SLs are used in multiple studies, for errors to be falsely replicated. In this "DIA-eXchange" project our goal is to unlock the potential in public DIA data by developing database(s), standards and software so that when DIA proteomics datasets are published the SL (and source evidence) is deposited into PX to make DIA proteomics "FAIR"-compliant. This will enable other groups to verify published findings and re-analyse DIA data for new purposes. We will benchmark open source software and different methods of creating SLs to develop best practice guidelines, and re-analyse 100s of datasets ourselves, depositing the standardised/uniform protein abundance values in "added value" biologist-focussed databases, namely EMBL-EBI's Expression Atlas.

Summary

Proteins are important molecules that carry out most of the activities that take place in each cell of an organism, such as transporting substances and providing structural support. A proteome is the complete set of all the proteins in a system or organism under certain conditions at a given time, and proteomics is the large-scale study of proteomes. Proteomics applies to many parts of biology as it can tell us a lot about how a system or organism works, and can provide vital information about illnesses and potential treatments. The main technique used in proteomics research is mass spectrometry (MS), which works by breaking up a mixed protein sample into small fragments, sorting them and then reporting their mass. This information is used to determine the identity and amount of the proteins. Recently, a MS approach called data independent acquisition (DIA) has become popular. Traditional MS, called data dependent acquisition (DDA), is biased towards the fragments that have the strongest signal, but DIA is not limited by this. This means that DIA allows researchers to quantify proteins that are present even in very small numbers, allowing for better representation of the proteome. Spectral libraries are collections of pre-annotated experimental MS outputs that are used in DIA data analysis. Recently spectral libraries have been developed using machine learning, which provides a great opportunity for novel artificial intelligence (AI) approaches to proteomics research. Overall, quantitative DIA data is very rich, as it represents a comprehensive digital record of the proteome that can be analysed using different tools and approaches over time. The groups involved in this project have been working to make DIA proteomics data freely available worldwide via the ProteomeXchange (PX) consortium, and to ensure that this data is generated and reported using consistent standards via the Proteomics Standards Initiative (PSI). This publicly-available data provides a great opportunity for researchers to reconfirm original results and obtain new insights. However, there have so far been very limited re-analysis efforts. This may be due to the complex nature of DIA data analysis, and also because of a lack of availability of spectral libraries. Our project aims to address this by generating new knowledge coming from the re-analysis of DIA proteomics datasets and creating novel infrastructure to better support public DIA proteomics data and spectral libraries. Additionally, we will create novel infrastructure for making spectral libraries Findable, Accessible, Interoperable and Re-usable (FAIR), which will enhance the reproducibility of published studies. To achieve these goals we will produce reliable and high-quality protein expression (i.e. protein production) and abundance information from the re-analysis of manually curated public DIA quantitative datasets and we will make these freely available in PX and via EMBL-EBI's Expression Atlas, to be consumed by non-experts in proteomics. We will also create protein co-expression and abundance maps for different biological conditions using the DIA re-analyses and make them available via PX. This would be the first time that these maps are generated on such large amounts of DIA proteomics data and will take advantage of the unique advantages, such as size and coverage, of DIA datasets. Further, we will develop novel infrastructure and data standards to make DIA proteomics data and, as a key point, spectral libraries FAIR. This will involve creating open source tools and infrastructure, and developing PSI standards. The co-expression maps, infrastructure and standards that will be generated by this project will benefit researchers across a wide range of biological and biomedical fields, and will provide the ability to strengthen and connect existing research findings. We will disseminate our work widely to train and assist researchers in making full use of these valuable resources.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsX – not assigned to a current Research Topic
Research PriorityX – Research Priority information not available
Research Initiative UK BBSRC-US NSF/BIO (NSFBIO) [2014]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file