Award details

In silico mass spectrometry for biologists: Tools and resources for next-generation proteomics

ReferenceBB/P024599/1
Principal Investigator / Supervisor Dr Juan Antonio Vizcaino
Co-Investigators /
Co-Supervisors
Dr Sarah Butcher, Dr Steven Newhouse
Institution EMBL - European Bioinformatics Institute
DepartmentProteomics
Funding typeResearch
Value (£) 444,779
StatusCompleted
TypeResearch Grant
Start date 01/12/2017
End date 30/11/2020
Duration36 months

Abstract

To date, mass spectrometry (MS)-based proteomics has been largely driven by Data-Dependent Acquisition (DDA) approaches, where complex mixtures of peptide analytes are separated via liquid chromatography and elute into the instrument. This approach is limited by instrument throughput and the stochastic sampling of the analyte, leading to under-sampling and poor detection of low abundance proteins. To address such limitations, Data Independent Acquisition (DIA) approaches are gaining popularity, led by SWATH-MS and MSe/HD-MSe. These methods sample the analyte more uniformly and capture richer, deeper data, but generate more challenging data sets to interrogate which require sophisticated software solutions. Indeed, the lack of standard tools and the extra expertise required is preventing the further popularity and adoption of DIA proteomics approaches. Here, we will develop open analysis pipelines for different DIA techniques using non-commercial software (e.g. OpenSWATH, DIA Umpire, Skyline, etc), and deploy them in the EBI "Embassy Cloud" infrastructure. We will enable easy access to robust, and portable pipelines that can also be deployed in other cloud environments, for wider community benefit. In addition we will extend the functionality of the world-leading proteomics resource (PRIDE Archive at EMBL-EBI) and related tooling, extend the data standard mzTab to create a common output format of the analysis. A further compelling aspect is the link to PRIDE Archive that will support construction of robust spectral libraries (from different instruments and species), that can be used by us and our users to conduct novel DIA analyses. This will make good use of the growing DDA and DIA public datasets in PRIDE Archive to extract new knowledge. Novel results will be communicated to the original submitter and the rest of PRIDE Archive users, as well as into three EMBL-EBI resources: Ensembl, UniProt and the Expression Atlas.

Summary

Proteins are the key functional molecules in cells, performing multiple biological tasks. This includes catalysing reactions, providing structure to cellular components, signalling between different cells and regulating the production of other genes as transcription factors. The recent advent of genome sequencing has transformed our ability to study these molecules into a "Big Data" discipline, coupled to advances in mass spectrometry (MS) and allied computing techniques. This particular branch of the "'omics" is referred to as proteomics - the high-throughput study (identification and importantly, quantification) of all the proteins that can be detected in a given biological sample. For example, by discovery of the proteins that are more abundant in different life cycle stages (during development or during ageing) ,may give us clues as to which biological pathways control these processes. Proteomics is used right across biological and biomedical research for profiling systems as varied as plants, model organisms, infectious diseases/microbes, chronic disease of humans and animals, among many others. Currently, the primary technology used in proteomics is MS. Each assay (or scan) in a given MS run (one given experiment) provides us information about which proteins are present in our samples, by studying the peptides generated from them using a defined enzyme (e.g. trypsin). In the mass spectrometer, each peptide is broken up, and the instrument reports the masses of the different fragments in so called mass spectra. In the most traditional and most widely-used proteomics approaches nowadays, called 'data dependent acquisition' (DDA) techniques, only the most abundant peptides are measured by the instrument, and a lot of the remaining peptides are simply not detected and/or measured. This leaves the possibility that invaluable biological information is simply missed, which informs on the relative level of proteins in the cell. Recently, a novel group of proteomicapproaches are starting to be used which can overcome some of the limitations of DDA approaches, known as Data Independent Acquisition (DIA) methods. Excitingly, these methods capture a near-complete digital record of the proteome in that experiment, but require more sophisticated software tools to mine these DIA maps. Relatively few groups are expert in their use, limiting the potential of the community to analyse the growing numbers of DIA data sets. Additionally, the current software tools are not yet robust enough, nor available on user-friendly web-based platforms that the average biologist can use. In this project, we will develop and build open software able to analyse proteomics datasets generated using these novel DIA proteomics approaches in a robust manner, so they can be used in the future by anyone in the community. This will be achieved by making the software available on the European Bioinformatics Institute's "cloud" IT infrastructure. When the project finishes, the generated software pipelines will be ready to be deployed in other similar infrastructures in the UK and internationally. We will also improve and refine current analysis methods by using proteomics data already made available in the public domain, by extending existing collections of mass spectra called spectral libraries. This will support a rich portfolio of (re)analysis methods for the user base, with 'plug and play' components, that also includes support for detection of so called post-translational modifications (PTMs), which are notoriously difficult to identify otherwise. The project outputs will greatly benefit a wide-range of biological and biomedical researchers interested in proteomic techniques for interrogation of samples - even if they don't have access to mass spectrometers. We will ensure this is disseminated via delivering workshops, training and online help/tutorials.

Impact Summary

There is the potential for the following impacts: - Mass spectrometry vendors (at least SCIEX and Waters) will benefit through the free availability of robust, reliable, reproducible and improve pipelines for the analysis of DIA proteomics datasets. When these pipelines are robust, there will not be the urgency to keep developing their own commercial software solutions, with gains in resources that could be focused in other efforts. - Software vendors or pharmaceutical research and development teams, since we envisage they may wish to take up our software for local pipelines (e.g. deployed in their own cloud environments). It is important to highlight that all the software developed during the proposal will be open source or at least free-to-use (if the original software use to build the analysis pipelines is not open source). Commercial software will not be part of the developed pipelines. - Research councils and charities funding research will benefit through the potential for increased impact of the mass spectrometry (MS)-based proteomics projects they fund, thanks to the re-analysis of public DIA proteomics datasets and the integration of novel proteomics data in Ensembl, UniProt and the Expression Atlas. - Leveraging research partnerships and funding with industry via knowledge exchange and innovation funding has been successfully demonstrable at UoM. We have been fruitful with MRC CiC, P2D, Wellcome Trust ISSF, HEIF, and EPSRC IAA funding streams, which are all aimed at promoting and driving impact. Manchester projects with an MS foundation have always been successful in the life and biomedical sciences, in themselves generating high impact papers and multiple millions of GBP in industry and key stakeholder support. - There is potential for our infrastructure to assist in clinical biomarker discovery, since DIA based methods (such as SWATH-MS and MSe/HD-MSe) are hugely growing in use in this space, as exemplified by the Stoller Biomarker Discovery Centre Manchester (where some of the applicants are involved). - More broadly, as proteomics is a key technology in the Life Sciences, there is the potential for considerable indirect benefits on a wide range of areas in basic biology, biomedical or clinical science, as more value will be derived from datasets, including post-translational modifications (PTMs) - key regulators of cell signalling, and thus often studied in the clinical context. Staff employed will benefit: - Further training in one key enabling technology for the BBSRC (proteomics) and exposure to conferences, workshops and new national and International collaborations. - Acquire skills needed to work with bioinformatics software in a cloud environment, something that is getting increasingly important with the growing size of datasets and the need of suitable IT infrastructure.
Committee Research Committee C (Genes, development and STEM approaches to biology)
Research TopicsTechnology and Methods Development
Research PriorityX – Research Priority information not available
Research Initiative Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file